Basic principles of prediction program

We assume that nucleotide sequences of highly-expressed genes should be adopted to support high rate of all consecutive expression processes: transcription, RNA processing, mRNA export and translation. Comparative analysis of high- and low-expression mRNAs revealed significant difference in some contextual features (e.g., 5'UTRs of low-expression mRNAs were longer, they have less optimal context of the translation initiation codons of the main open reading frames and contain more frequently upstream AUGs; Kochetov et al., 1998). These preliminary data allowed us to develop a knowledge base accumulating information on 5'UTR discriminative features (Kochetov et al., 1998).

Nucleotide sequences of high (H-) and low (L-) expression mRNA 5'UTRs of mammalian, dicot and monocot mRNAs were extracted from the EMBL databank and stored in the subdatabase Leader_SEQ. Widely distributed proteins (like actins, tubulins, ribosomal proteins, translation factors, HSP70, CAB, RBCS) attributed to H group are compared with regulatory proteins (transcription factors, cyclins, protein kinases, early response genes) referring to L group. Next, discriminative features of H and L 5'UTRs were compared. It was found that some characteristics are different between the groups:

1. 5'UTR length;

2. leader nucleotide content (including the G/C and A/U ratios and G+C content, possibly stabilising RNA secondary structure);

3. AUG codons (in different contexts) in the 5'UTR;

4. mono-,di-, and trinucleotide concentrations in the vicinity of translational start site;

5. position-dependent nucleotide frequencies in the vicinity translational start site.

These H- and L-characteristic features are stored in a subdatabase Leader_Why. These data also were used as discriminative features in prediction program. This program estimates the set of the above-mentioned parameters of the nucleotide sequence of the 5'UTR tested and makes prognosis of its translational activity. The prognosis is made as a sum of all parameters used (computer system allows to user to select parameters in the list and change their relative significance). Leader_RNA correctly predicted 84% of high- and 76% of low-expressed genes in a control set of sequences.

In general, Leader_RNA represents a unique computational resource for evaluation of mRNA translational efficiency in higher plant or mammalian cells. It may also be useful for designing of gene engineering experiments (e.g., tools for prediction of the changes in translation properties of modified mRNA sequences as well as for evaluation of translation efficiency of transgene mRNA in new host organism).

Kochetov A.V., Ponomarenko M.P., Frolov A.S., Kisselev L.L., and Kolchanov N.A. (1999) Prediction of eukaryotic mRNA traslational properties. Bioinformatics, V. 15, No. 7/8, pp. 704-712.
[Full Text]

Kochetov A.V., Ischenko I.V., Vorobiev D.G., Kel A.E., Babenko V.N., Kisselev L.L., Kolchanov N.A. (1998) Eukaryotic mRNAs encoding abundant and scarce proteins are statistically dissimilar in many structural features. FEBS Lett., V. 440, pp. 351-355.
[Full Text]


1997-99, IC&G   SB RAS, Laboratory of Theoretical Genetics