Prediction of RNA secondary structure using genetic algorithm with steepest descent

I.I. Titov, V.A. Ivanisenko, N.A. Kolchanov

Institute of Cytology and Genetics, Siberian Division of the Russian Academy of Sciences.

     Genetic algorithm (GA) is one of the powerful "kinetic" (i.e. based on a limited search) optimization methods. GA employs the evolutionary analogies through iteration of mutations and recombinations in the individual genotypes with phenotypic (fitness) selection in each calculation cycle. Every RNA secondary structure is unambiguously defined by its own set of stems; hence, it is natural to choose the latter as genes for GA-based calculations. Although such a coarse-grained representation is computationally convenient, it leads to a formal paradox: according to the FDC criterion (fitness-distance correlation, Jones and Forrest, 1995), the RNA secondary structure prediction is GA-hard, since the number of structures lacking common stems with the minimal-energy structure is exponentially great. On the other hand, the additivity of energetic rules for RNA secondary structure is a good basis for validity (important for GA efficiency) of building blocks hypothesis; in addition, a number of examples demonstrates successful GA application for secondary structure energy minimization.

     We modified GA to provide the maximal distinction of its dynamics from the routes in statistical FDC-metry. The major peculiarities of our algorithm are:

  1. Recombination symmetry: we employ exchange of the clusters with approximately equal number of stems, whereas the recombinations evenly random in size are conventionally used. Thus, we obtain the progeny that is maximally distinct from the parents, providing a more extensive search.
  2. Local exhaustion instead of stochastic mutations: the conventional role of mutations is to prevent premature GA convergence to a local optimum; instead of mutations, we remove a defined number of random stems and form new stems on the principle of the fastest energy optimization.
  3. Initial diversity and population size: the initial set is created of random individuals, their maximal noncoincidence provided; the set is formed when the stem set is exhausted (each stem occurs at least once).

     The described algorithm possesses a higher efficiency compared to the GA with conventional rules. On IBM-486, it takes our algorithm less than a minute to select the global minimum of RNA free energy in a space of 108 states. Thus, the framing of GA quality criterion in FDC terms in the space of its own operators is essential. However, some complications connected with dynamic character of operatorsí action may arise. For example, the absence of any convergence is observed when recombination frequency is considerable ("error catastrophe").