COMPUTER TOOL FOR THE PREDICTION OF EUKARYOTIC mRNA TRANSLATIONAL PROPERTIES

PREDICTION OF EUKARYOTIC mRNA TRANSLATIONAL PROPERTIES

Alex V. Kochetov^1*, Mikhail P.Ponomarenko¹, Anatoly S. Frolov¹, Lev L. Kisselev², Nikolay A. Kolchanov¹

1Institute of Cytology and Genetics, Pr. Lavrentieva 10, Novosibirsk, 630090 Russia

2Engelhardt Institute of Molecular Biology, Moscow, 117984 Russia

Abstract.

Motivation: It is well known that eukaryotic mRNAs are translated at different levels depending on their sequence characteristics. Evaluation of mRNA translatability is of importance in prediction of the gene expression pattern by computer methods and to improve the recognition of mRNAs within cloned nucleotide sequences. It may also be used in biotechnological experiments to optimize the expression of foreign genes in transgenic organisms.

Results: The sets of 5'UTR characteristics, significantly different between mRNAs encoding abundant and scarce polypeptides, were determined for mammals, dicot plants, and monocot plants, and collected in the LEADER_RNA database. Computer tools for the prediction of mRNA translatability are presented.

Availability: Programs for mRNA translatability prediction are available at
http://wwwmgs.bionet.nsc.ru/programs/acts2/mo_mRNA.htm (for monocots), http://wwwmgs.bionet.nsc.ru/programs/acts2/di_mRNA.htm (for dicots),
http://wwwmgs.bionet.nsc.ru/programs/acts2/ma_mRNA.htm (for mammals).
The LEADER_RNA database may be accessed at:
http://wwwmgs.bionet.nsc.ru/systems/LeaderRNA/

Contact: AK@bionet.nsc.ru

Introduction

Prediction of the gene expression pattern through computational analysis of the nucleotide sequence is one of the main tasks of modern bioinformatics. Accurate prediction is very complicated because the level of eukaryotic gene expression may be regulated at various steps: transcription, pre-mRNA processing and export, mRNA translation, and the cytoplasmic stability of the mRNA and polypeptide. Contextual and structural features of the gene nucleotide sequence may influence the efficiency of expression at all stages, hence they should be considered in detail. Analysis of parameters of mRNAs influencing translatability in eukaryotic cells is one of the particular tasks in the framework of this general problem.

It is known that translational efficiency of eukaryotic mRNAs vary considerably with their sequence characteristics (for a review, see Kozak, 1994; Gallie, 1996; Pain, 1996). Contextual and structural features of the 5’ untranslated region (5’UTR) significantly affect the rate of translation initiation and, thereby, the level of polypeptide production (Ray et al., 1983; Kozak, 1987; Futterer and Hohn, 1992; Gallie and Walbot, 1992). It is widely accepted that the majority of eukaryotic mRNAs are translated through the linear scanning mechanism (for a review, see Kozak, 1994). According to this model, several features of the leader sequence influence mRNA translational efficiency, i.e. the context of the translational start codon, occurrence of AUGs within 5’UTR, and the stable secondary structure in the leader. It was shown that some 5’UTR features, apart from those listed above, are capable of influencing mRNA translatability. 5’UTRs of some cellular and viral mRNAs ("translational enhancers") increase the translation efficiency of the downstream coding sequences (Jobling and Gehrke, 1987; Gallie et al., 1987; for a review, see Gallie, 1996). However, nucleotide sequences of these enhancers have no common elements.

An analysis of nucleotide sequences of mRNAs that were shown experimentally to be translated at different levels may be used to reveal the sequence features important for the high mRNA translatability. However, available experimental data are not sufficient to perform the computer analysis. To overcome this problem, we have made a comparative analysis of the characteristics of mRNAs encoding abundant and scarce eukaryotic polypeptides (Kochetov et al.,1998a,b). The high rate of polypeptide synthesis is likely to be achieved if all expression processes (transcription, splicing, translation, etc.) occur at a high rate, because low efficiency at any stage limits the total polypeptide production. Thus, investigation of nucleotide sequences of mRNAs of highly expressed genes can be helpful to reveal the sequence features essential to support efficient expression. There are several examples of the usefulness of this approach (e.g. research on the relative "strength" of translational termination codons (Brown et al., 1990), frequencies of the synonymous codons usage (Ikemura, 1985), etc.).

We have compared the mRNA features of several groups of housekeeping genes, highly expressed in eukaryotic cells (H-mRNAs), and regulatory genes with low expression under stringent control (L-mRNAs). It was found that 5’UTRs of H-mRNAs differ considerably from those of L-mRNAs and presumably could support more efficient translation (Ischenko et al., 1996; Kochetov et al., 1998a,b). This is an argument in favor of the assumption that high expression level of eukaryotic genes is provided with highly efficient mRNA translation.

Significant difference between H- and L-mRNAs was used to design the prediction method evaluating the translatability of mRNAs of newly sequenced genes. Computer tool for prediction of mRNA translation properties of genes in mammals and higher plants is presented. This technique also permits evaluation of the translatability of mRNAs of foreign genes (transgenes) in mammalian and plant cells by comparing their sequence characteristics with those of high and low expression host mRNAs.

Systems and methods

The LEADER_RNA database has been implemented in the C language of the ANSI standard. It has been successfully compiled on the Intel PC platform using Borland C compiler, version 4.5, under Windows95. The basic scheme of the LEADER_RNA database is shown in Figure 1. The LEADER_RNA database consists of four different domains. First, the sequence database LEAD_SEQ compiles the sequences of mRNA 5’UTRs of high and low expression eukaryotic genes. Second, the knowledge base LEAD_KNO contains the description of 5’UTR features different for H- and L-mRNAs. The programs implementing these features to predict mRNA translatability activate this knowledge base. These programs are documented within the knowledge base by their control test results on the independent experimental data. Third, the database LEAD_WHY contains the description of published experimental data concerning the influence of mRNA sequence features on translatability. Fourth, abstracts of the related papers are collected in the LEAD_REF database. The LEADER_RNA database is SRS-formatted, and, hence, commonly accepted through the WWW-interface to SRS-users (Etzold and Argos, 1993).

Non-redundant mRNA sequences of high- and low-expression eukaryotic genes were extracted from EMBL nucleotide sequence databank, Release 49, and novel non-redundant sequences were extracted from EMBL, Release 52. A training set was compiled from 5’UTRs of both these sets whereas the control set contained only novel sequences from the later EMBL release. We believe that this control set reflects the accumulation of nucleotide sequences in the databank and genes of interest for molecular biologists.

Selection of H- and L-mRNAs

mRNAs of mammalian and higher plant genes representing different eukaryotic taxa were analyzed. Monocot and dicot plant mRNAs were analyzed separately, because they are considerably different in many contextual features (Kochetov et al., 1998c). mRNAs of abundant eukaryotic proteins of the following families were tested (Table 1): A) (for all 3 taxa): translation elongation factor 1 alpha and ribosomal proteins, actins, 70 kDa heat shock proteins, histones; B) mammalian tubulins and myosins; C) plant anaerobiosis-induced (alcoholdehydrogenase, aldolase) and phothosynthesis - related (RbcS,Cab) polypeptides. All these polypeptides are vitally essential and synthesized in eukaryotic cells in considerable amounts. Sequences of mRNAs encoding abundant polypeptides were selected assuming that most of them should be efficiently translated. However, some eukaryotic polypeptides are encoded by a gene family, and contribution of various family members to the synthesis of an abundant protein may vary. Therefore, the H-mRNA set may include some sequences with minor contribution to the protein yield. Since the prediction technique presented in this paper is based on the statistical difference between high- and low-expression mRNAs, we believe that a little contamination of a high expression mRNA set with poorly translated mRNAs cannot decrease the prediction quality considerably.

L-mRNAs encoding the proteins that are present in the cells in small amounts include mRNAs for growth factors, receptors, transcription factors, protein kinases, proteins encoded by oncogenes and tumor-suppressor genes, and other regulatory proteins. The expression of these genes is under stringent control not only at the transcriptional level, but also through a decrease in stability of mRNA (Chen and Shyu,1995) and proteins (Pahl and Baeuerle,1996). To select mRNAs encoding transcription factors, we have used the TRANSFAC database (Wingender et al.,1996).

Both full-sized and possibly incomplete 5’UTRs were collected in training sets. For selection of full-sized 5’UTRs from the EMBL DNA entries, Feature Table keys "CDS" and one of the following -"mRNA", "precursor_RNA", "prim_transcript", and "5’UTR"- were used (all fields without "<"); for the selection of possibly incomplete 5’UTRs from the EMBL RNA entries, the start points of the sequence and the "CDS". Sequence characteristics of 5’UTRs with an experimentally mapped 5’end and 5’UTRs from the cDNAs were compared. It was found that full-sized 5’UTRs of H- and L-mRNAs were more different than possibly incomplete 5’UTRs. The difference between samples of complete and incomplete leaders of either H-mRNAs or L-mRNAs is much less than that between the samples of 5’UTRs of H- and L-mRNAs, including both full-sized and incomplete sequences. In addition, samples of full-sized mRNA 5’UTRs contain significantly smaller numbers of sequences than those of possibly incomplete 5’UTRs (Table 1). Thus, we have combined complete and incomplete sequences and analysed them together.

High expression mRNAs were represented in the control set with 19 5’UTRs of mammals, 27 5’UTRs of dicot plants, and 19 5’UTRs of monocot plants; low expression ones were represented with 60, 40, and 11 5’UTRs, respectively.

Data analysis

To reveal the sequence characteristics, significantly different for the 5’UTRs of H- and L-mRNAs, we have applied an approach from the computer system Sitevideo (Kel et al., 1993; Kolchanov et al., 1998). Under this approach, sets of nucleotide sequences of different functions (or different functional activities) are collected in the database SAMPLES and compared for various contextual features. The features differing significantly are described in a special knowledge base supplemented with the C-code programs for the discrimination between these types of sequences (Kolchanov et al., 1998).

By using this approach, the database LEADER_RNA (Fig.1) consisting from four related databases has been designed (see examples of entries in Figure 2). Sets of 5’UTR sequences of mammalian and higher plant mRNAs were selected from the EMBL databank (see above) and collected in the database LEAD_SEQ (Figure 2B). Each sequence is marked by the unique identifier in the field SC (taken from the field ID of the corresponding EMBL entry). The 5’UTR length, position in relation to translational start site (PA) and the level of gene expression (SA) are contained. Links to the related databases LEAD_REF (RN) and LEAD_KNO (KN) described below are presented for 5’UTRs of each taxonomic group.

5’UTR nucleotide sequences of H- and L-mRNAs of these taxa were analyzed by a number of various features (listed in Table 2) that may be subdivided into two groups: (i) features shown experimentally to influence the mRNA translation rate (e.g. the presence of the upstream AUGs, context of the translational start, etc.); (ii) other contextual features. A lot of 5’UTR contextual features were analyzed, including positional weighted concentrations of mono-, di-, tri-, and tetranucleotides (Ponomarenko et al., 1998). For the sequence S=s₁...s_i...s_L of the length L, the weighted concentration of the oligonucleotide Z=z₁...z_j...z_mof the lengthm is estimated by the equation:

, (1)
where 1£ m£ L; d _Z(s_is_i+1...s_i+m-1) is the function denoting the presence "1" or absence "0" of the oligonucleotide Z of the length m in the ith position of the sequence S; s_iÎ{A, T, G, C}.

The first position of oligonucleotide Z = z₁,…,z_j,…,z_m corresponds to the i-th position of the sequence S, where z_jÎ{A, T, G, C, W=A/T, R=A/G, M=A/C, K=T/G, Y=T/C, S=G/C, B=T/G/C, V=A/G/C, H=A/T/C, D=A/T/G, N=A/T/G/C}; w(i) - is the function of the significance of positions (0£ w(i)£ 1), which permits one to to take into account the fact that different oligonucleotides provide the most considerable impact if they are located in different site positions. Kolmogorov-Smirnov’s test, Pearson’s linear and Kendall’s rank tau (Kendall and Gibbons, 1990) correlation coefficients were used for measurement of the correlation between these variables and the gene expression level. These coefficients are based on different assumptions (parametric and nonparametric), thus, the relations between 5’UTR features and expression levels of corresponding genes were analyzed independently.

Features statistically different between H- and L-mRNA 5’UTR sets of dicot, monocot, and mammalian genes were collected in the knowledge base (LEAD_KNO) (Figure 2A). Each feature of this knowledge base is characterized by the mean values for high and low 5’UTR training sets (field "AB"), and for high and low expression control sets (fields ST and NT, respectively; mean values, standard errors and the percentages of incorrect prediction are presented there). A type of the feature and its dependence on the mRNA expression level are marked in the fields PV and CT, respectively. C-code program for the discrimination between H- and L-mRNAs on the basis of revealed statistical difference (Ponomarenko et al., 1997) were generated automatically (field C-CODE). The data for mammals, dicots, and monocots are stored separately because the 5’UTR features, characteristic for H- and L- mRNAs of these taxa, are different in various subsets tested (listed in Table 2). The field DP contains the link to the database of mRNA features that were shown experimentally to influence translatability (LEAD_WHY, see example entry in Figure 2C). Each type of feature (classified in the fields MI, MN, MD, ML) is characterized according to its utility (e.g. statistical difference between H- and L-mRNA 5’UTRs (fields PN, PM, PV)) or known experimental data (free text in the field REASONS)). Links to the abstracts of related papers collected in database LEAD_REF (Fig.2D) are placed in the field RN.

Algorithm

The knowledge base contains the number of discrimination features revealed in the analysis (46 for mammals, 27 for dicots, 20 for monocots) together with the related programs for prediction of the translational efficiency of a given mRNA. Mean values of these features for the H- and the L- mRNA training sets (Xi_H and Xi_L, respectively) correspond to +1 and –1, respectively. To predict the translatability of a given mRNA by using these discriminating features, the values of its 5’UTR characteristics (Xi) are determined and compared with those for H- and L- mRNAs (according to the equation 2):

, (2)

If Xi value exceeds that of H- or L-mRNAs, then it is equal to +1 or -1, respectively.

Since the database compiles the expert rules introduced by the investigators on the basis of various experimental approaches, we could not evaluate the relative contribution of these rules in mRNA translatability. It should be principally important to make available all opinions concerning mRNA translational efficiency for use in translatability prediction. This approach is based on Decision Making Theory (Fishburn, 1970). Thus, we have collected all revealed differences in the knowledge base and generated the prediction programs for all revealed 5’UTR characteristics (N). The computer program for prediction of a given mRNA translatability tests all these 5’UTR features and displays the results. The score F(seq) is calculated as follows:

, (3)

The value of F(seq) varies from –1 (in the case of typical L-mRNA) to +1 (in the case of H-mRNA). If F(seq) >0, the sequence is assumed to be translated at the high level, if F(seq)<0, at the low level.

In the general case, all discriminative features (Fi) are considered independently. However, a user may change the weights of the features by the following:

, (4)

Weight coefficients (Wi) ranging from 0 to 10 can be defined by a user. If Wi=0, the corresponding 5’UTR feature is excluded from prediction. If all Wi are equal, F(seq) is calculated according to equation 3.

Implementation and discussion

In the framework of the Sitevideo approach (Kel et al, 1993; Kolchanov et al., 1998), a lot of 5’UTR contextual features were analysed. Those found to be statistically different between high- and low-expression samples of mammalian, dicot, and monocot mRNAs were collected in the knowledge base. An mRNA of interest could be analysed by the WWW accessible program by using these discriminative 5’UTR features. The 5’UTR nucleotide sequence should be typed or browsed from the file. The 5’UTR features analyzed are listed and supplemented with the weight coefficients Wi (ranging from 0 to 10; see equation 4). In the simplest case, the mRNA translational efficiency (Fseq) may be analyzed by applying all discriminative features together and with weight coefficients Wi =5. It can be seen that if all weight coefficients Wi are equal, the equation (4) is transformed into equation (3). It was found that H- and L-mRNAs in the control sets are predicted with a high accuracy. Namely, 77.8% of dicot, 78.3% of mammalian and 84.2% of monocot H-mRNAs were predicted to be translated efficiently (F(seq)>0); 72.5% of dicot, 84.4% of mammalian and 81.8% of monocot L-mRNAs were predicted to be translated at low efficiency (F(seq)<0).

To make the prediction more accurate and to use this computer system to design the biotechnological experiments, the 5’UTR discrimination features should be analyzed in a more complicated manner. The sets of discrimination features may be subdivided into several groups. Analysis of the translation efficiency of human cytochrome P450 IID6 (CYP2D6) mRNA (GenBank ID HUMCYP2D6) in dicot plant cells is shown as an example (Fig. 3). Twenty-seven various 5’UTR features were used to evaluate the mRNA translatability in dicot plant cells. A user may define the subset of criteria and their weight coefficients Wi (in this case, all criteria are included, and their Wi=5 in all cases).

1. The presence of upstream AUGs ([AUG] content). It is well known that the presence of AUG codons within 5’UTR decreases the mRNA translation rate. The negative influence of these upstream AUGs depends on their context (the "stronger" context enhances the negative influence) and on position of encoded ORFs. Commonly, ORFs started from the upstream AUGs are very short. It was found that the negative influence of upstream mini-ORF is much higher if this ORF overlaps with the protein coding sequence (for areview, see Futterer and Hohn, 1996). Parameters [AUG] optimised and [AUG] "-3"-ruled take into account the contexts of the upstream AUGs. Parameter [AUG] framed encounts the overlap of miniORF(s) and the coding sequence. Since 5’UTR tested does not contain upstream AUGs, both [AUG] content and three related parameters are equal to +1.

2. Context of the translational start codon. To evaluate the relative "strength" of the translational start site, the parameters High-consensus matches, Low-consensus matches, High-ShortFreqMatr, and "-3 position" rule were used. They concern the similarity of start codon context (from -6 to -1) with consensus (High-consensus matches) and anti-consensus (Low-consensus matches) of dicot H-mRNA. High-ShortFreqMatr is a more advanced parameter taking into consideration the frequency matrix of nucleotides in positions of consensus of H-mRNA translational start codon context which were found to be statistically different between H- and L-mRNAs. "-3 position" rule accounts for the nucleotide in the -3 position upstream the translational start codon.

3. 5’UTR significant features which are likely to influence the mRNA secondary structure. It is known that 5’UTR secondary structure may decrease mRNA translatability (Kozak,1994; Pain,1996). The content of G+C and the ratio of the complementary nucleotides (A/U, G/C, G/U) influence the stability of the potential secondary structure. If the frequencies of the complementary nucleotides in 5’UTR are close to each other, the potential secondary structure could be more stable. These parameters were tested for H- and L-mRNAs, and those differing significantly were included into the knowledge base. The parameters [A]:[T] ratio and [A]:[T] disbalance were used in the case of dicot mRNA.

4. Other 5’UTR significant features.

Most 5’UTR features (17 in total) are classified within this subset. They were included into the knowledge base, because they were statistically different between the H- and L-mRNAs of dicot genes. The utility of all these 27 features was evaluated by analysis of the control sets of H- and L-mRNAs. Percentages of incorrect prediction of either H-mRNAs or L-mRNAs are marked in the knowledge base and may be accepted directly within the WWW-based prediction program. The user could define the utility of a particular criterion and decide whether to use it or not in the prediction method.

5. Limitations and exceptions

It should be taken into account that the prediction technique presented here is based on the scanning model (Kozak, 1994) and may be used for the analysis of 5’UTRs of mRNAs translated by this mechanism. In the framework of this model, we analyse the basal level of mRNA translational activity. In some cases, mRNA translational efficiency may be regulated by polypeptide trans-factors recognizing cis elements within 5’UTR (e.g. iron responsive element or oligopyrimidine tract at the 5’end of ribosomal mRNAs (for a review, see Pain, 1996)). Prediction of specific sites demands the use of special computer tools and cannot be perfomed by our program.

Currently, there are many examples of mRNAs known to be translated through internal ribosome entry site (IRES) or ribosomal shunt (for a review, see Futterer and Hohn, 1996; Pain, 1996). These mRNAs often contain AUGs and stable hairpins within the leader sequences, and their translatability cannot be correctly predicted by the tool presented in this paper. The problem of recognizing of IRES within mRNA 5’UTR is very complex since it is not clearly known what sequence elements may form it.

Conclusion

In this work, we have tried to determine the 5’UTR features possibly influencing the mRNA translational efficiency in mammals, dicots, and monocots. Sequence analysis of various 5’UTRs features of H- and L-mRNAs was performed to define which of them are discriminating. The knowledge base contains the number of discrimination features revealed in this analysis (46 for mammals, 27 for dicots, 20 for monocots) together with the related programs for prediction of the translational efficiency of a given mRNA. The list of these discriminative features will be supplemented due to accumulation of the novel experimental and statistical data. In particular, we plan to analyse the 5’UTR sequences of H- and L-mRNAs for the secondary structure parameters, and use these data in the prediction process.

It should be noted that mRNAs of both high and low expressed genes are necessarily translated, though with different efficiency. High-expression mRNAs must provide a high polypeptide synthesis rate during development, under stress conditions, etc. Thus, most of these mRNAs should be translated efficiently although some members of multigene families making a minor contribution to polypeptide production may produce mRNAs translated at a low level. Many of the low-expression mRNAs (encoding transcription factors, oncogenes, protein kinases, etc.) contain sequence elements which were shown in experiments to decrease their translation efficiency (Kozak, 1986; 1994; Rao et al., 1988). It may be of importance to limit the level of expression of the regulatory polypeptides, because their overproduction may be harmful. However, low translational efficiency of mRNAs is a very frequent, but not obligatory, feature of the genes encoding regulatory polypeptides. We assume that a statistically significant difference between the contextual features of H- and L-mRNA samples may reveal the features influencing mRNA translatability, as was shown for the occurrence of upstream AUG and the context of the translational start codon (Ischenko et al., 1996; Kochetov et al., 1998a; 1998b). It is important to take into account the 5’UTR features that are essential but still not determined (i.e. translational enhancers (Gallie, 1996)).

In this paper, we presented the results of the comparative analysis of 5’UTRs of H- and L-mRNAs accumulated in WWW-accessible database LEADER_RNA, and a computer tool for the prediction of mRNA translational properties in cells of mammals and higher plants. Prediction of mRNA translatability is based on the similarity of its 5’UTR contextual features with those typical of mRNAs of high- and low-expression genes of the corresponding taxon. The LEADER_RNA database and the related WWW-accessible prediction program are under development. In future, we plan to extend the list of eukaryotic taxa and number of 5’UTR criteria. It was shown that H- and L-mRNAs are also different in the relative "strength" of the translational termination signal (Brown et al., 1990; Kochetov et al., 1998b) and in the usage of synonymous codons (Likhoshvai and Matushkin, 1998); these important mRNA contextual features are planned to be taken into consideration. In the context of the general purpose of this work (i.e. the development of computer techniques to predict the pattern of gene expression), we plan to extend the computer analysis of high- and low- expression eukaryotic genes to involve other functional domains (mRNA 3’UTR, introns, signals of splicing and polyadenilation, etc.) and expression processes other than mRNA translation.

ACKNOWLEDGEMENTS

This work was supported by the Russian Fund for Basic Research and Russian Human Genome Program. A.K. was supported by SD RAS grant for young scientists and RFBR grant for the support scientific school of academician V.K. Shumny. N.K. was also supported by a SD RAS interdisciplinary grant. L.K. benefited from the Chaire Internationale Blaise Pascal (Ecole Normale Superieure, Paris, Ile-de France), from Human Frontier Science Program and from Program of Support for Scientific Schools (Russia). The authors are grateful to G.V. Orlova for assistance in translation the manuscript into English.

References

Brown,C.M., Stockwell,P.A., Trotman,C.N.A. and Tate,W.P. (1990) Sequence analysis suggests that tetra-nucleotides signal the termination of protein synthesis in eukaryotes. Nucleic Acids Res. 18, 6339-6345.

Chen,C-Y.A. and Shyu,A-B. (1995) AU-rich elements: characterization and importance in mRNA degradation. Trends Biochem. Sci. 20, 465-470.

Etzold,T. and Argos,P. (1993) SRS - an indexing and retrieval tool for flat file data libraries. Comput. Applic. Biosci., 9, 49-57.

Fishburn,P.C. (1970) Utility theory for decision making. New York, Jonh Wiley & Sons.

Futterer,J. and Hohn,T. (1996) Translation in plants - rules and exceptions. Plant Mol. Biol. 32, 159-189.

Futterer,J. and Hohn,T. (1992) Role of an upstream open reading frame in the translation of polycistronic mRNAs in plant cells. Nucleic Acids Res. 20, 3851-3857.

Gallie,D.R (1996) Translational control of cellular and viral mRNAs. Plant Mol. Biol. 32, 145-158.

Gallie,D.R. and Walbot,V. (1992) Identification of the motifs within the tobacco mosaic virus 5’-leader responsible for enhancing translation. Nucleic Acids Res. 20, 4631-4638.

Gallie,D.R., Sleat,D.E., Watts,J.W., Turner,P.C. and Wilson,T.M. (1987) The 5'-leader sequence of tobacco mosaic virus RNA enhances the expression of foreign gene transcripts in vitro and in vivo. Nucleic Acids Res. 15,:3257-3273.

Ikemura,T. (1985) Codon usage and tRNA content in unicellular and multicellular organisms. Mol. Biol. Evol. 2, 13-34.

Ischenko,I.V., Kochetov,A.V., Kel,A.E., Kisselev,L.L. and Kolchanov,N.A. (1996) Comparative analysis of the local secondary structure of mRNAs encoded by high and low-expression eukaryotic genes. In: Computer Science and Biology. Proceeding of the German Conference on Bioinformatics, Leipzig, pp. 124-129.

Jobling,S.A. and Gehrke,L (1987) Enhanced translation of chimaeric messenger RNAs containing a plant viral untranslated leader sequence. Nature 325,622-625.

Kel,A.E., Ponomarenko,M.P., Likhachev,E.A., Orlov,Y.L., Ischenko,I.V., Milanesi,L. and Kolchanov,N.A. (1993) SITEVIDEO: a computer system for functional site analysis and recognition. Investigation of the human splice sites. Comput. Applic. Biosci., 9, 617-627.

Kendall, M. and Gibbons, J.D. (1990) Rank correlation methods (5th edition). Edward Arnold, London.

Kochetov,A.V., Ponomarenko,M.P., Vorobiev,D.G., Frolov,A.S., Kisselev,L.L. and Kolchanov,N.A. (1998a) Eukaryotic mRNAs encoding abundant and scarce proteins are dissimilar in many structural features of 5’-untranslated leaders. In Kolchanov,N.A. et al. (eds), Proc. 1st Intern. Conf. of Bioinformatics of genome regulation and structure, Novosibirsk, 1, pp. 214-217.

Kochetov,A.V., Ischenko,I.V., Vorobiev,D.G., Kel,A.E., Babenko,V.N., Kisselev,L.L. and Kolchanov,N.A. Eukaryotic mRNAs encoding abundant and scarce proteins are statistically dissimilar in many structural features. FEBS Lett., 440, 351-355.

Kochetov,A.V., Pilugin,M.V., Kolpakov,F.A., Babenko,V.N., Kvashnina,E.V. and Shumny,V.K. Structural and compositional features of 5’untranslated regions of higher plant mRNAs. (1998b) In Kolchanov,N.A. et al. (eds), Proc. 1st Intern. Conf. of Bioinformatics of genome regulation and structure, Novosibirsk, 1, pp. 210-213.

Kolchanov,N.A., Ponomarenko,M.P., Kel,A.E., Kondrakhin,Yu.V., Frolov,A.S., Kolpakov,F.A., Kel,O.V., Ananko,E.A., Ignatieva,E.V., Podkolodnaya,O.A., Stepanenko,I.L., Merkulova,T.I., Babenko,V.N., Vorobiev,D.G, Lavryushev,S.V., Ponomarenko,Yu.V., Kochetov,A.V., Kolesov,G.B., Podkolodny,N.L., Milanesi,L., Wingender,E., Heinemeyer,T. and Solovyev,V.V. (1998) GeneExpress: a computer system for description, analysis, and recognition of regulatory sequences of the eukaryotic genome. In Glasgow,J. et al. (eds), Proceedings of The Sixth International Conference on Intelligent Systems for Molecular Biology, ISMB-98, Montreal, Canada, AAAI Press, pp.95-104.

Kozak,M. (1994) Determinants of translational fidelity and efficiency in vertebrate mRNAs. Biochimie 76, 815-821.

Kozak,M. (1987) At least 6 nucleotides preceding the AUG initiator codon enhance translation in mammalian cells. J. Mol. Biol. 196, 947-950.

Kozak,M.(1986) Bifunctional messenger RNAs in eukaryotes. Cell 47, 481-483.

Likhoshvai,V.A. and Matushkin,Yu.G. (1998) Theoretical analysis of possible evolutionary trends in codon distribution along the mRNA In Kolchanov,N.A. et al. (eds), Proc. 1st Intern. Conf. of Bioinformatics of genome regulation and structure, Novosibirsk, 2, pp. 341-344.

Pahl,H.L. and Baeuerle,P.A. (1996) Control of gene expression by proteolysis. Curr. Opin. Cell Biol. 8, 340-347.

Pain,V. (1996) Initiation of protein synthesis in eukaryotic cells. Eur. J. Biochem. 236,747-771.

Ponomarenko,M.P., Kolchanova,A.N. and Kolchanov,N.A. (1997). Generating programs for predicting the activity of functional sites. J. Comput. Biol. 4, 83-90.

Ponomarenko,M.P., Frolov,A.S., Ponomarenko,J.V., Podkolodnaya,O.A., Vorobyev,D.V., Kolchanov,N.A. and Overton,G.C. (1998) Mean-recognition: a systematic approach increasing the accuracy of the functional site recognition for the genomic DNA annotation. In Callaos,N. and Holmes,L. (eds), Proceedings of the World Multiconference on Systemics, Cybernetics and Informatics, SCI’98, Volume 4, Orlando, Florida, pp.224--230.

Rao,C.D., Pech,M., Robbins,K.C., Aaronson,S.A. (1988) The 5' untranslated sequence of the c-sis/platelet-derived growth factor 2 transcript is a potent translational inhibitor. Mol. Cell Biol. 8, 284-292.

Ray,B.K., Brendler,T.G., Adya,S., Daniels-McQeen,S., Miller,J.K., Hershey, J.W.B., Grifo,J.A., Merrick,W.C. and Thach,R.E. (1983) Role of mRNA competition in regulating translation: further characterization of mRNA discriminatory initiation factors. Proc. Natl. Acad. Sci. USA 80, 663-667.

Wingender,E., Dietze,P., Karas,H. and Knueppel, R. (1996) TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res. 24, 238-241.

TABLES

Table 1. Samples of the 5’UTR sequences of H- and L-mRNAs.

Taxon
H-mRNAs

Total and 5’ mapped
L-mRNAs

Total and 5’ mapped

Mammals

77 (26)

151 (17)

Dicot plants

187 (45)

266 (22)

Monocot plants

92 (15)

68 (11)

Table 2. 5’UTR features that are statistically significantly different between the mRNAs of high- and low-expression genes^a.

Features of 5’UTR mRNAs^b

Content of the complementary nucleotides

([A]:[T] ratio, [A]:[T] disbalance, [G]:[C+T] disbalance (Mm), [G+C] content (Mm), [G]:[C+T] ratio (Mm)

Leader length

Context of translational start site

(Depends on the "-3 " rule, High-consensus matches, High-ShortFreqMatrix, Low-consensus matches

Content of upstream AUGs in different contexts

[AUG] content, [AUG] in frame with the coding sequence, [AUG]:[-AUG] ratio (Mm,Mc), [AUG]:[-AUG] disbalance (Mm,Mc), [AUG] optimized (at -3 position), [AUG] optimized (at -3 and +4 positions).

Nucleotide content in 5’UTRs

[T] content, [G] content (Mm), [K] content (Mc), [KB] content (Mc), [SCV] content (Mm), [WxHK] (Mm), [TS] content (Mm,Dc)

High/Low frequency ratios for oligonucleotide content:

K,M, {K/M,K/M}, {K/M,K/M,K/M} (Mc,Dc), {K/M,K/M,K/M,K/M} (Mc,Dc), {K/M,K/M,K/M,K/M,K/M} (Dc), {K/M,K/M,K/M,K/M,K/M,K/M } (Dc), {K/M,N,K/M}(Mm,Dc), {K/M,N,K/M,N,K/M} (Mm,Dc), {K/M,N,K/M,N,K/M,N,K/M } (Mm,Dc), {A/T/G/C, A/T/G/C} (Mm,Dc), {A/T/G/C, A/T/G/C, A/T/G/C} (Dc), {A/T/G/C, N, A/T/G/C} (Mm,Dc), {A/T/G/C, N, A/T/G/C, N, A/T/G/C }(Dc), {R,Y}(Mm), {R/Y,R/Y}(Mm), {R/Y,R/Y,R/Y}(Mm), {R/Y,R/Y,R/Y,R/Y}(Mm), {W/S,W/S}(Mm), {W/S,W/S,W/S} (Mm), {W/S,W/S,W/S,W/S}(Mm), {R/Y,N,R/Y}(Mm),{R/Y,N,R/Y,N,R/Y}(Mm), {R/Y,N,R/Y,N,R/Y,N,R/Y}(Mm), {W/S,N,W/S}(Mm), {W/S,N,W/S,N,W/S}(Mm), {W/S,N,W/S,N,W/S,N,W/S}(Mm)

Mm - mammals; Dc - dicots; Mc - monocots.
^aTaxa for which this feature is statistically significant are in brackets, if the feature is significant for all taxa tested this mark is not included.
^bR=G/A; Y= T/C; M= A/C; K= G/T; W=A/T; S=G/C; B=not A; V =not T; H=not G; D=not C;

FIGURES

Figure 1. Scheme of the LEADER_RNA database.

(A)
CF SIGNIFICANT FEATURE
CT Translation INCREASES with DECREASING leader length
DP Length
PV Leader Length
AB 70.549 138.917
UT 1
LC 0.05
ST 67.111 (26.052) 11.1%
NT 149.400 (90.241) 42.5%
XX
C-CODE
double DiEx_Len (char *s){
double X; char *seq; int k, SiteLength=3;
seq=&s[0]; k=strlen(seq); if(k < SiteLength+1)return(-1001.);
X=(double)k;
X-=104.733; X*= -0.029; if(X < -1.)X=-1.; if (X > 1.)X=1.;
return(X);}

(B)
MI LIDER00Di
MN Dicot plants mRNA translational activity (structural features of 5'UTR)
KN KNOW00Di
OG Dicot plant high and low expression genes
OS Magnoliopsyda, dicot plant
OC Eukaryota
FF 5'untranslated region of mRNA
AN translational activity
AU Dichotomy: "High"=1 and "Low"=--1
PN Translational start
RN RF0001
SC AHR15BCOP;
SQ SEQUENCE LENGTH 47
catacatata actcaacttt gggaagccaa caagtacact aataaca
SA 1.
PA 48

C)
MI Length
MN Structural
MD mRNA leader
ML Leader length
RN RF0003
PN Kolmogorov-Smirnov, Kendall TAU, Pearson LCC
PM Statistical
PV Training Test
PU Alpha
WW http://wwwmgs.bionet.nsc.ru/Programs/acts2/helps/tr_Table_1.htm
REASONS
It was shown that the length of the 5' untranslated sequence can......

D)
RN RF0001
RA Joshi CP
RT An inspection of the domain between putative TATA box and translation
RT start site in 79 plant genes.
RJ Nucleic Acids Res
RV 15
RP 6643 6653
RY 1987
RR Over 75 published genomic DNA sequences from several higher plants have...

Figure 2. LEADER_RNA database domains: (A) LEAD_KNO (knowledge base), (B) LEAD_SEQ (database on 5’UTR nucleotide sequences), (C) LEAD_WHY (database on published experimental and statistical data concerning mRNA features influencing translatability), (D) LEAD_REF (reference database containing abstracts of related papers). For details, see Data analysis.

A)
CTGAGAGTGT CCTGCCTGGT CCTCTGTGCC TGGTGGGGTG GGGGTGCCAG GTGTGTCCAG
AGGAGCCCAT TTGGTAGTGA GGCAGGT
B)
Expert [Estimate*Weight=Decision]'s are following:
1. Translation INCREASES with DECREASING the Leader length 0.514257 * 5 =2.57129
2. Translation INCREASES with DECREASING [T] content 1 * 5 = 5
3. Translation INCREASES with DECREASING [AUG]:[-AUG] disbalance 1 * 5 = 5
4. Translation INCREASES with INCREASING [A]:[T] ratio 1 * 5 = 5
5. Translation INCREASES with INCREASING [AUG]:[-AUG] ratio 1 * 5 = 5
6. Translation INCREASES with DECREASING [A]:[T] disbalance 1 * 5 = 5
7. Translation INCREASES with DECREASING [AUG] content 1 * 5 = 5
8. Translation INCREASES with DECREASING [AUG] framed -1 * 5 = -5
9. Translation INCREASES with DECREASING [AUG] optimized 1 * 5 = 5
10. Translation INCREASES depends on the "-3 position" rule -1 * 5 = -5
11. Translation INCREASES with DECREASING [AUG] "-3"-ruled -1 * 5 = -5
12. Translation INCREASES with DECREASING [K] content of [-17;-1] -1 * 5 = -5
13. Translation INCREASES with DECREASING [KB] content of [-17;-1] -1 * 5 = -5
14. Translation INCREASES with INCREASING High-consensus matches -1 * 5 = -5
15. Translation INCREASES with DECREASING Low-consensus matches -1 * 5 = -5
16. Translation INCREASES with INCREASING High-ShortFreqMatr -0.777347 * 5 =-3.88673
17. Translation INCREASES with INCREASING High/Low 1bp(ATGC)-FreqRatio -1 * 5 = -5
18. Translation INCREASES with INCREASING High/Low 1bp(KM)-FreqRatio -1 * 5= -5
19. Translation INCREASES with INCREASING High/Low 2bp(KM)-FreqRatio -1 * 5= -5
20. Translation INCREASES with INCREASING High/Low 3bp(KxM)-FreqRatio -1 * 5= -5
21. Translation INCREASES with INCREASING High/Low 5bp(KM)-FreqRatio -1 * 5= -5
22. Translation INCREASES with INCREASING High/Low 6bp(KM)-FreqRatio -1 * 5= -5
23. Translation INCREASES with INCREASING High/Low 3bp(ATGCx)FreqRatio -1* 5 = -5
24. Translation INCREASES with INCREASING High/Low 5bp(ATGCx)FreqRatio -1* 5 = -5
25. Translation INCREASES with INCREASING High/Low 3bp(KxM)-FreqRatio -1 * 5= -5
26. Translation INCREASES with INCREASING High/Low 5bp(KxM)-FreqRatio -1 * 5= -5
27. Translation INCREASES with INCREASING High/Low 7bp(KxM)-FreqRatio -1 * 5= -5
Prediction is mean of the Expert decisions = -0.417151 that means Low.

Figure 3. Prediction of translational efficiency of mRNA of the human cytochrome P450 gene in dicot plant cells. (A) 5’UTR sequence, (B) discriminative criteria. The weight coefficient is 5 for all criteria.

Taxon	H-mRNAs Total and 5’ mapped	L-mRNAs Total and 5’ mapped
Mammals	77 (26)	151 (17)
Dicot plants	187 (45)	266 (22)
Monocot plants	92 (15)	68 (11)