Fitness - a WWW-resource for RNA folding simulation based on genetic algorithm with local minimization.
I.I. Titov*, V.A. Ivanisenko, N.A. Kolchanov
Institute of Cytology & Genetics SD RAS, Novosibirsk 630090, Russia
*) To whom correspondence should be addressed
Key words: genetic algorithm, RNA folding, secondary structure, antisense oligonucleotide
Abstract
Motivation: Genetic algorithm is a promising method for RNA secondary structure prediction, combining the value of an effective minimization tool, the possibility to search for intermediate RNA folding states and the freedom in folding rules, e.g. an account for pseudo-knots. Existing nowadays genetic algorithms for secondary structure calculation are not fast enough for their use via Internet.
Results: This work presents a web-resource for RNA secondary structure prediction on the basis of genetic algorithm. Optimization speed and quality are obtained through combination of global search and local minimization, as well as by an effective sampling of the structure space with initial population and algorithm parameters optimization. Besides the RNA secondary structure prediction it can simulate its refolding under the influence of an antisense nucleotide. This feature can be useful for gene-directed drug design.
Availability: The program is implemented as web server and can be reached at http://wwwmgs.bionet.nsc.ru/mgs/programs/2dstructrna/
Contact: titov@bionet.nsc.ru
Introduction
Three-dimensional RNA structure prediction is a long-standing goal of the bioinformatics (for recent experimental and computer methods review see [18]). In the last years several groups reported the use of genetic algorithm (GA) to solve this problem, both for the conformation of a hairpin at the atomic level [24] and for the longer molecules folding [3, 4, 14, 25, 27] in coarse-grained representation of the secondary structure.
GA is a powerful minimization method, taking advantage of the analogies with the natural evolution (see for review [9]). Usually GA consists of cyclic action of three so-called genetic operators: a) solution selection by means of the fitness function, that estimates their relative quality, b) solution recombinations, that implement a large-scale search in the solution space, c) mutations, that prevent premature convergence. The basis for the GA is the assumption that the best solution can be obtained by combining good ones (building blocks hypothesis [17]). Therefore application of GA to the RNA folding search is attractive because of the additivity of its secondary structure energetic rules [28]. Other essential advantages of GA are a possibility to search for the intermediate states [14] and total flexibility in energetic rules - up to accounting for tertiary structure elements, pseudo-knots [11], forbidden in today's most popular dynamic methods [21, 31].
Despite the long history of GA, its adequacy (e.g., optimal parameters) to a specific problem is not predictable. Numerous suggested criteria (see for review [2]) have at their basis a simple heuristic idea [12] that GA must work on smooth landscapes, where considerably different solutions differ essentially in their quality. This is not necessarily the case of the RNA secondary structure: extensive numerical studies of the free energy surfaces for random sequences have shown that they are usually rugged [7]. Therefore it is not surprising that GA application to the RNA secondary structure prediction requires tens of hours of calculation [14] or parallel computations [4]. Correspondingly, till today there missed a GA that can solve such a problem via Internet. In this paper we present the first algorithm of this kind, which allows to find optimal and suboptimal secondary structures as well as to model their changes under the influence of antisense nucleotide. Modeling of interaction with oligonucleotide can be useful for gene-directed drug design.
Our algorithm is organized in the following way: it reads a nucleotide sequence and determines all possible stems. From these stems it creates an initial population of secondary structures. Each possible stem is found in the initial population at least once. Then the population is exposed to the recurrent stochastic action of the genetic operators. At the selection step the least stable structures are eliminated from the population. Resulting vacancies are filled with structures, produced at recombination of a pair of the survived structures. In doing this, the offspring is made maximally different from the parents. Further the population is exposed to multiple mutations, in the course of which the mutating structure loses some of its stems and then is consecutively built up with stems from the general list. During the building up are selected such of them that provide the most stable structure, as in the simplest algorithms modeling the RNA folding [1, 20, 23]. The result of the mutation is rejected if the original structure is more stable than its mutant. Each iteration of the algorithm ends in checking the population diversity, so that the calculation is stopped at its degeneration. The result is the numbers of the double-stranded bases of the most stable secondary structure at the moment of calculation stopping. For calculation of the structure of a complex with antisense oligonucleotide additionally are specified bases of target molecule, which are considered to be screened by the oligonucleotide from the complementary interactions within the molecule. The calculation itself is carried out the same way, with the exception that from the list of possible stems are eliminated the ones that contain specified bases.
Realized in such a way mutations and recombinations reveal a high optimization quality even separately. It allows the algorithm to converge rather fast to the optimal structure or to the one energetically close to it. (Related combination of global search by recombinations with local minimization was already used at hairpin atomic structure calculation [24]). Typical calculation time of our algorithm for few hundred nucleotide sequences is only slightly worse than the fastest at present time dynamic programming methods [21, 31].
Methods
a) secondary structure representation
In the GA each solution (individual) is characterized by the set of its elements (genes). RNA secondary structure is unambiguously determined by stems that compose it, which in virtually all GAs defines its choice for genes [27] (another motivation in favour of such a choice is the folding simulation [14]). In such a way each structure was related with a set of its characteristic stems from list {h} of all possible ones (G-U pairs were allowed).
b) initial population
A good population in GA should be of small size for its fast convergence and, second, should have such an initial state, from which it is possible to achieve a global optimum (in other words, premature convergence can be prevented by covering fairly the solution space). As a trade-off between these two contradictive requirements we designed the initial population in the following way: it contained all the stems from {h} and therewith we tried to avoid their occurence in different structures. Adding (by the rules of their stereochemical compatibility) randomly stems from {h} generated structures. The list {h} was divided into two sublists: {h}1 of the stems residing in structures already emerged, and {h}2 of the rest. At formation of the next structure it was first replenished with stems from {h}2 (which after that moved to {h}1) At the {h}2 exhaustion the structure was built up with stems from {h}1. The procedure ended at the number of structures achieving the population size N=100.
c) selection
Probability of the structure i elimination from the population was calculated according to the equation:
,
where Ei is its free energy. At the energy calculation were used parameters of the nearest neighbours and loops from the compilation [28]. Competition specifying parameters M and D E have the following meaning: M is the expected number of structures eliminated in one generation, D E is the effective energetic resolution, that is i.e. the difference in energy that makes two structures fitness differ e-fold.
d) mutations
For mutations in GA are taken only single gene substitutions, therefore they implement the nearest search. Mutations are considered a minor element of GA, so that in practice it is recommended that they be strongly restricted in intensity to suppress the noise they introduce [12]. Mutations realized in our algorithm were its important element. First a given number of structures were selected randomly from the population. A fixed number of stems, S, were removed from a structure. The steepest descent was carried out from this state, and in doing this the most advantageous (in stability of structure obtained) stems were added consecutively at every step (Fig.1). In this part the algorithm was similar to some RNA folding modeling algorithms [1, 20, 23]. If the resulting structure was at a disadvantage in energy in relation to the initial original one, then the result of mutations was rejected. In such a manner, the blind search for conventional mutations was replaced with the move between separate local optima, so that sometimes the global optimum could be achieved by mutations per se.
e) recombinations
Executing a large-scale search, it is recombinations that distinguish GA from other algorithms of stochastic optimization. On the other hand, usually the choice of the crossover point and the size of fragments being interchanged are uniformly random - and thus there is no precise boundary between the results of mutations and recombinations. Recombinations implemented in our algorithm were directed to the equal and thus maximal difference of the offspring from both parents. Two parental structures, which were selected at random, were matched and their common stems formed an offspring. Its structure was built up by adding random stems by one alternately from parental structures. (Fig.2) When all remaining parental stems became incompatible with already added, the structure was built up with stems from the general list {h}.
f) stop criterion
GA testing has shown that strong population degeneration at not very low N reports that it is irreversibly trapped in the global or deep local optimum, that makes further calculation improfitable. Population degeneration degree was estimated from the equation
,
where Ki is the number of stems in the structure i, kij is the number of stems coinciding in structures i and j. The calculation is stopped after the parameter D exceeds the critical value.
g) modeling interaction with antisense oligonucleotide
The secondary structure of the target molecule in complex with antisense oligonucleotide was simulated on the basis of the simplest physico-chemical model. It was presumed that 1) oligonucleotide binds to specific bases of the target, which further cannot participate in complementary interactions within the target, 2) then the molecule relaxes to its free energy minimum taking account that the bases are screened 3) oligonucleotide binding does not lead to the change of the energetic parameters. Thus, first were specified the numbers of bases of the target, that make part of the intermolecular Watson-Crick pairs. Then a routine GA calculation of the target secondary structure was carried out starting from this reduced list (Fig.3).
h) metric in the structure space
To estimate the effects of structure changes and to compare the calculated structures with experimentally determined ones it is useful to define formally the distance between structures. Several metrics of this kind were suggested previously (see [16] for review). We defined the distance H between structures i and j of the sequence of length L as a normalized sum over partial refolding events:
,
,
where in structure i and stand for double- and single-stranded base k, correspondingly; abs(x) means absolute value of x. The metric H has a practical meaning: exactly the occurrence of bases in this or that state is revealed by the fermentative methods (and is used for further structure deduction by computer minimization). At the same time this metric does not always reflect structure topological proximity.
Algorithm and Implementation
The entire GA protocol is brought off in the following sequence of the procedures described above:
1) Calculate the list {h} of all possible stems for a given RNA sequence.
2) Design an initial population of N structures from {h}.
3) Expose the population to the selection.
4) Run recombinations and fill vacancies resulting from the step 3 with their products.
5) Run multiple mutations.
6) Return to the step 3, if the maximal number of generations or population degeneration is not achieved.
7) Choose from the last generation the structure with the lowest free energy as the result of the calculation.
The algorithm is implemented in C++ language and is installed on P-133. Besides being typed in a window, a nucleotide sequence could be read as a text, GenBank or EMBL files. User also inputs some algorithm parameters - minimal stem size, "mutation radius" S and, in case of modeling interaction with oligonucleotide, numbers of bases to which it binds. At the output he gets numbers of double-stranded bases of the calculated structure.
Results
Comparing with experimental data
The algorithm was tested on the fragment of E.Coli threonyl-tRNA synthetase mRNA and its two mutants having the same secondary structure determined by fermentative analysis [26] (third mutant was not considered because its structure contains a few non-canonical G-A pairs). A set of structures with lowest energies was tested independently by combinatorial method of branches and boundaries. Both methods reported that the experimental structure corresponds to the calculated optimum only when thermodynamic parameters are additionally edited. Namely, we raised by 0,5-1,0 kcal/mol the penalty for multiloops of size more than 5 bases. Interestingly, GA converged distinctly faster with edited parameters. This probably means that a large number of false local optima become shallow and bears witness for kinetic control of the RNA folding. Conversely, noisy distortions of thermodynamic parameters can crater funnel-like RNA landscape and slow down RNA optimization.
For all the three sequences the GA predicted all stems of experimental structure. Calculated and experimentally suggested structures gain the distance H (defined in Methods and measures 0.11, 0.10 and 0.11 for the wild type and two mutants, correspondingly) at the big slightly structured loop of the latter, within which the GA formed stems.
GA parameters optimization
To determine the best conditions for the GA functioning, trial calculations were run. Dependence on the selection and recombination parameters, M and D E, was investigated on the mentioned above wild type mRNA fragment. Therewith the mutations in the GA were switched off. As D E decreases, optimization mode change can be observed. At high D E (quasineutral mode) population drifts randomly until it degenerates at some minimum. D E decreasing strengthens competition among local minima and assists better optimization (Fig. 4). The same consequence have high values of expected death rate, that intensify the search. At D E less than 3 kcal/mol the dependence of the achieved minimum value on these parameters substantially falls off. There is a little point in further D E reducing, because it only extends the number of generations needed for convergence (Fig. 5). This also suggests that a typical error of the molecule structure energy is about 3 kcal/mol.
The number of stems being eliminated during mutations, S, could not be optimized on the mRNA fragment, where even at S=1 (compare with K range from 8 to 11 in the initial population) the GA virtually always finds global optimum. (Note that the estimated size of search space was 1015 structures.) Therefore the dependence on S (Fig.6) was studied on longer random sequence. One random sequence was chosen out of 20 random ones of 300-bp length. This sequence revealed the most stable convergence (more than in 50% of cases) to the structure, that had the lowest free energy of all observed for it. The GA finds this structure even at S=2, that is significantly less than 15-21 - the number of stems, which formed the structures in the initial population. Further increase of S leads to more probable finding of target structure and to convergence to more and more energetically low structures in other cases.
Finally we tested the potential of optimized algorithm to find global minimum. For random sequences it is known that optimal structures have their energies linearly depending on sequence length [8]. The presented algorithm was found to achieve global optimum for sequences up to 150 nt in length (Fig.7). The longer sequences have a chance to be optimized through multiple GA runs.
Modeling structure refolding under the influence of antisense oligonucleotide
For analysis were used the same 20 random sequences of 300 bases length. From results of 6-13 GA runs the best structure for each sequence was picked out. The most stable stem hs in this structure was determined. It was suggested, that oligonucleotide binds to 5' part of hs. Further calculation was carried out following the described in the Methods procedure. Finally the obtained structure was matched to the initial one. Energies of the initial and refolded structures turned out to be linearly related (data not shown). At the refolding events localization analysis it was found that they more commonly are uniformly distributed along the sequence, still sometimes they are found dominantly in the region restricted with the 5' and 3' parts of hs (Fig.8).
Discussion
The spatial RNA structure is a necessary prerequisite for fulfilling its biological functions. Its prediction is one of the oldest problems of bioinformatics and dates back even to the 60s [10]. During these years there was a great progress: a lot of computer algorithms emerged, and energetic rules were updated several times (the most recent table see in [29]). Nevertheless, up to now as an aim for all the algorithms serve structures obtained by phylogenetical analysis. Besides the incomplete knowledge of the thermodynamic parameters, it is possible that lower reliability of the algorithms is determined by different conditions of the RNA existence in vivo and in vitro (ionic strength of the solution, interactions with proteins etc.). Similar problem for proteins is solved by energy parameter derivations from structural information. The advent of RNA parameters of this kind could increase the accuracy of minimization algorithms.
The other important aspect at the secondary structure calculation is that functional structure could appear as an intermediate folding state. There exist a number of experimental arguments in favour of the kinetic control of the RNA folding [11, 19, 30]. This fact has stimulated emergence of folding modeling algorithms, both heuristic [1, 13, 14, 20, 23] and based on kinetic model of the process [6, 22]. Because energies of the loops make the destabilizing contribution to the energy of the structure and at the same time introduces a kinetic barrier for its folding, the kinetic and thermodynamic RNA folding control should commonly agree, leading to the funnel-like folding landscapes. This is supported by the comparison of results of the calculations of tRNA and random sequence energy spectra [15]. The calculations of structures in our work are in agreement with that. So, an optimal structure of mRNA fragment sometimes was found in the population already within few generations after the calculation beginning. Second, the best in GA convergence random sequence had the structure with the lowest energy among 20 studied. But for longer molecules native structure can differ strongly in energy from optimal one [16]; here parameter noise reductions or the folding heuristics [14] can enhance the prediction accuracy.
Third, rare algorithms can account for tertiary structure elements such as pseudo-knots.
The GA can consider all the three points mentioned. It is useful as a traditionally powerful optimization algorithm. The GA can simulate the RNA folding process and find its intermediate states. Finally, it can introduce pseudo-knots in the folding rules [14]. All this makes the GA a considerably promising computer method for the RNA structure analysis.
We proposed a GA-based technique for RNA secondary structure prediction. The structure landscape can be rough and the search space can be extensive. Our algorithm has shown that it is able to overcome these difficulties. Prediction accuracy of the algorithm can be increased using up-to-date energetic rules [29], taking into account partially overlapping stems and pseudo-knots and disregarding destabilizing stems. Our preliminary results show that the latter allows GA to find optimum of sequence of up to 800 nt length. The other work direction is the interface improvement, in particular, a possibility of visualisation in different formats of the obtained results.
In the conclusion we would like to speculate an intriguing possibility of an analogy between GA operators’ action and processes that rule the RNA folding in the cell. Mutations can simulate RNA refolding events under the influence of heat motion or some denaturating agents. For selection by structure energy there finds another reason: the structures with lower energy usually are more compact [8] and therefore are less bound to degrade. Not so clear is the analogy for recombinations, which refer to an interaction of a pair of structures. The recombination product can be represented as a result of two structures refolding at their bimolecular interaction by chance complementary fragments. Another possibility of representation is activation by one RNA molecule of the chaperone-like transmitter that induces refolding in the other molecule. Although there is yet no evidence about RNA chaperone, some RNA-binding proteins can facilitate the correct RNA folding in vitro [5].
Acknowledgement
The authors are grateful to V.P. Valuev for the translation of the manuscript. The work was supported with the grants RFBR-INTAS 95-0653 and RFBR 99-04-49879.
References
Figures
Figure 1. Mutation scheme. At the first step a part of stems is chosen (shown bold in the left figure) Then they are eliminated (central figure) and structure is consecutively built up with the most advantageous stems to the final one (right figure)
Figure 2. Recombination scheme. In bold lines are shown stems inherited by offspring from parental structures. Stems a and b are common for the parents and are directly transferred to the offspring. Stems c and c’ are inherited at equal crossover of the parental structures.
Figure 3. Modeling of refolding under the influence of an antisense oligonucleotide. It is supposed that on the first stage the oligonucleotide (shown with dashed line) contacts with target at the binding site (left figure). Oligonucleotide binding to the target destroys local structure of the latter. On the third stage structure of the sequence out of interaction with the oligonucleotide relaxes to the optimal one (right figure).
Figure 4. Value of the achieved minimum for mRNA fragment (averaged over 4 GA runs) at different energetic resolution parameters D E (semilogarithmic scale) and two values of the expected death rate: M=10 (triangles) and M=90 (squares). The optimal structure had energy of –10.1 kcal/mol.
Figure 5. Population degeneration time (number of GA generations to reach D=0.9, averaged over 4 runs) against the expected death rate M at different energetic resolution D E: 0.1 (rhombs), 1 (squares), 2 (triangles), 8 (crosses).
Figure 6. Dependence of the achieved minimum value (averaged over 4 GA runs) on mutation radius for random sequence of 300 nucleotide length. The best structure found has energy of –49.7 kcal/mol.
Figure 7. Structure energy of random sequences (averaged over 10 sequences) against their length. For each sequence GA was run once.
Figure 8. An example of refolding events localization in random sequence structure under the influence of antisense nucleotide. The refolding events (h(k)=1) cluster in the region between 5’ and 3’ part of the stem (shown with arrows), to the destruction of which was directed the oligonucleotide. The refolding probability, estimated as a number of events per one nucleotide, was 0.49 in the region and 0.19 in the rest of the sequence.