Eukaryotic mRNAs encoding abundant and scarce proteins are statistically dissimilar in many structural features
Alex V. Kochetova, Igor V. Ischenkoa, Denis G. Vorobieva, Alexander E. Kela, Vladimir N. Babenkoa, Lev L. Kisselevb, and Nikolay A. Kolchanova*
aInstitute of Cytology and Genetics, Pr. Lavrentieva 10, Novosibirsk, 630090 Russia
bEngelhardt Institute of Molecular Biology, Moscow, 117984 Russia
ABSTRACT Structural and contextual features have been found to differ in the 5’ untranslated regions (5’ UTR) of eukaryotic mRNAs encoding high- and low-abundant proteins. Statistically, 5’UTRs of low-expression mRNAs are longer, their guanine plus cytosine content is higher, they have less optimal context of the translation initiation codons of the main open reading frames and contain more frequently upstream AUGs than 5’UTRs of high-expression mRNAs. Apart from the differences in 5’UTRs, high-expression mRNAs contain more stronger termination signals. Structural features of low- and high-expression mRNAs are likely to contribute to the translational efficiency and to the yield of their protein products.
Key words: Eukaryotic mRNAs; Translational efficiency; 5’ UTR; Context effects; Statistical analysis
*Corresponding author: e-mail: firstname.lastname@example.org; Fax: 7-(3832)-331278; Phone: 7-(3832)-333468
Abbreviations: Exp, expected; H-mRNA and L-mRNA, eukaryotic mRNAs encoding highly abundant and scarce proteins, respectively; Obs, observed; UTR, untranslated region of mRNA.
It is known that translational efficiency differs considerably for various eukaryotic mRNAs. Contextual and structural features of the 5’ untranslated region (5’ UTR) affect significantly the initiation of translation [1, 2]. The productive recognition of the AUG codon as an initiator depends on its nucleotide context. Adenine at position -3 relative to AUG and, to less extent, guanine at position +4 provide the optimal context for translation initiation in mammalian cells, while U or C at position -3 decreases initiation efficiency . If a protein-coding sequence does not start from the first upstream AUG codon, the preceding AUG codon(s) in 5’UTRs usually are in a nonoptimal context . Some of the 40S ribosomal subunits recognize these upstream AUGs and eventually initiate translation upstream from the genuine initiation site . The secondary structure of 5’UTRs may also affect translational initiation. Hairpins have negative effects on the migration of 40S ribosomal subunits along mRNA . Effect of a hairpin on eukaryotic mRNA translation in vivo depends on hairpin stability and location within the leader (reviewed in [2, 4]).Translation of mRNAs containing higher-order structures within 5’UTRs may be also affected by translation initiation factors (reviewed in ).
Some mRNAs containing AUGs and stable secondary structure elements within their 5’UTRs may be efficiently translated by binding of the ribosomes to internal ribosome entry site (IRES) of 5’UTR, as, for example, was shown for some picornoviral and some eukaryotic mRNAs . Though the ribosomes bind to IRES without scanning of folded 5’UTR segments, most eukaryotic mRNAs are translated by linear 5’UTR scanning (reviewed in ).
Primary structure of the mRNA coding regions may also affect the translational efficiency. In prokaryotes and some eukaryotes, genes encoding proteins of high and of low abundance show preferences in codon usage (reviewed in [8, 9]). Distribution of the translating ribosomes along mRNA may be non-uniform  because local secondary structure of mRNA may affect ribosomal movement.
Statistical analysis of translation stop signals for various eukaryotic taxa has shown non-random distribution of nucleotides at two positions immediately upstream of the stop codons [11-13]. The 5’context analysis of termination codons in humans  has demonstrated that U is over-represented at position 3 upstream from UAG. At the last sense codon position: UUU (Phe), AGC (Ser), Lys and Ala codon families before UGA; AAG (Lys), GCG (Ala) and the Ser and Leu codon families before UAA; UCA (Ser), AUG (Met) and Phe codon family before UAG are over-represented, while Thr and Gly are under-represented before UGA and UAA, respectively. Collectively, the results demonstrate that 5’ contexts of termination codons in E. coli  and higher eukaryotes  are similar.
As for the 3’ contexts of the stop codons, eukaryotic taxa exhibit a bias at the nucleotide position +1 (immediately downstream of the stop codon): frequency of purine bases is high and of C is low [11,12]. The important role of the base adjacent to the stop codon was confirmed by in vivo experiments on readthrough of the internal UGA in human mRNA encoding iodothyrosine deiodinase and also in vitro with a set of tetraplets containing stop codons .
The mRNAs of eukaryotic genes with contrasting expression levels may differ in contextual features and structural organisation. Here we compare the mRNA structural features of several groups of house-keeping genes highly expressed in eukaryotic cells and of regulatory genes whose expression is low and under stringent control.
2. MATERIALS AND METHODS
The programs were written in Borland C and run on IBM/PC Pentium-100. Computer program MGL  was used for handling mRNA databases. Statistical parameters were calculated using the STATISTICA package (StatsoftTM)
mRNA sequences were taken from EMBL database, release 49. The coding sequences and the 5’ UTR of mRNAs were analysed. Redundant sequences were eliminated from all data sets. The set of 3’ UTR mRNA sequences was also analysed using the EMBL entries as for the 5’ UTR sequences of mammalian mRNAs.
Selection of mRNAs for sets of H-mRNA and L-mRNA. We analysed 404 H-mRNAs encoding abundant eukaryotic proteins from the following families: translation elongation factor 1 alpha (eEF1a ) and ribosomal proteins, actins, tubulins, 70-kDa heat shock proteins, myosins, and histones. All these polypeptides are essential for cell viability and are synthesised in eukaryotic cells in considerable amounts. For example, eEF1a ranks second after actins, comprising up to 2% of the total cell protein . Heat enormously induces gene expression leading to a 1000-fold increase in the mRNA content and intense synthesis of heat shock proteins .
Some eukaryotic polypeptides are encoded by gene families, and contribution of various family members to the synthesis of an abundant protein may vary. Therefore, the H-mRNA set may include some sequences with minor contribution to the protein yield. However, the data available so far are insufficient to compose a representative set of mRNA known to be efficiently translated.
L-mRNAs encoding rare proteins were represented by 323 sequences, including mRNAs for interferons, interleukins, growth factors, receptors, transcription factors, proteins encoded by oncogenes and tumor suppressor genes, and other regulatory proteins. Expression of these genes is known to be under stringent control not only at the transcriptional level, but also through a decrease in stability of mRNAs  and proteins . To select mRNAs encoding transcription factors, TRANSFAC database  was employed.
For analysis of the 5’UTR lengths, in subsets of H- or L-mRNA full-sized 5’UTRs with 5’ends mapped experimentally were considered. In all other cases, both full-sized and possibly truncated 5’UTRs were analysed together.
Context analysis of translational start and stop codons. The average nucleotide frequencies in 5’UTR mRNA sets were considered as Exp for the AUG codon context. Similarly, the average nucleotide frequencies in 3’UTR of mRNAs were used as Exp for analysis of termination codon (UAA, UAG and UGA) contexts. UAA-, UAG- and UGA-containing 3’UTR subsets were processed separately. For each of the four nucleotides, the significance of the deviation of the Obs frequency at a given position from the Exp frequency was calculated using the formula: c 2=(Obs-Exp)2/ Exp. Four c 2 values for each position were summarised to estimate total deviation at this position (with three degrees of freedom). This approach was used earlier to compare the context features of translation termination codons [11, 14, 15].
H- and L-mRNAs were compared using c 2 test  as a criterion of homogeneity in 2x4 table. Total deviation for the four nucleotides was calculated (3 degrees of freedom).
The distribution of 5’UTR lengths and nucleotide content of H- and L-mRNA sets were compared using Kolmogorov-Smirnov test.
Length of the 5’UTRs of H- and L-mRNAs
We compared the lengths of 5’UTRs of H- and L-mRNAs with experimentally mapped 5’ ends containing 145 and 56 sequences, respectively. The mean length of the 5’UTR of L-mRNAs (248 nucleotides) considerably exceeded that of H-mRNAs (85 nucleotides). 5’UTRs were shorter than 100 nucleotides in 70% of H-mRNAs and in 30% of L-mRNAs (Fig.1). According to Kolmogorov-Smirnov test the difference between the two sets was significant (P<0.001).
Nucleotide composition of H- and L-mRNAs
Since guanine and cytosine significantly contribute to the stability of RNA secondary structure we analysed the G+C content in mRNA sets for various taxa. The G+C content in 5’UTRs was higher for the warm-blooded vertebrates, as reported earlier ; it was 4-6% higher for L-mRNA than for H-mRNAs of the same taxon (Table 1). The difference between H- and L-mRNAs for the arthropods is smaller, either due to peculiarities of the gene set available so far for arthropods or due to their genuine species-specific features. The difference in the G+C content of H- and L-mRNA 5’UTRs is significant (P<0.001).
No significant difference in the G+C content was found between the coding regions of H- and L-mRNA sets. The observed differences (see Table 1) may be explained by species specificity. These data are in good agreement with isochore structure of the genomes  and the G+C enrichment of the genomic DNAs of warm-blooded vertebrates. The G+C content in the protein-coding regions of H- and L-mRNAs may be related to their location in isochores.
GC pairs have a major impact on the hairpin stability and therefore sequences containing more G and C have a potential to form more stable secondary structures. However, if the sequence contains non-equal amounts of the complementary nucleotides the possibility to form stable secondary structure is lowered. G/C and A/U ratios were determined for mammalian H- and L-mRNA 5’UTRs. It appeared that contents of the complementary nucleotides were considerably more asymmetric in the H-mRNA 5’UTRs. For example, G/C ratio was close to 1 (0.75<G/C<1.25) in 31.4% of H-mRNA leaders and in 42.3% of L-mRNAs leaders of Mammalia. Similarly, A/U ratio was close to 1 in 21.6% of Н- and 48.5% of L-mRNAs. Frequency of 5’UTR with highly asymmetric content of complementary nucleotides was about two times higher for H-mRNAs compared to the L-mRNAs The frequency of nucleotide sequences with similar contents of G and C, A and U were found to be significantly higher in 3’UTR than in 5’UTR; the difference for the 3’UTRs of H- and L-mRNAs was smaller than for the 5’UTRs (Table 2). From the fact that the sequences with equal content of complementary nucleotides were rather rare in 5’UTRs of H-mRNA, one may suggest that H-mRNAs possess weaker ability to form stable secondary structures in 5’UTR compared to L-mRNAs.
Contexts of the initiator AUG codons
The contexts of the initator AUG codons of the mammalian H- and L- mRNAs were different (Table 3). C and U were found at position 3 prior to AUG in 23.2% of L-mRNAs (11.6% C and 11.6% U) and in 4.35% of H-mRNAs (3.48% C and 0.87% U). It is known that the nucleotide at the position 3 upstream from AUG has major influence on the efficiency of translation initiation; the highest efficiency was shown for A . In this study A at position -3 in H- mRNAs (59.1%) was 1.5 times more frequent than in L-mRNAs (40.4%). Therefore, the context of the translation initiation codon should be less optimal optimal in L-mRNAs. In general, AUG codon context of H-mRNA is closer to the consensus sequence (GCC)GCCA/GCCAUGG typical for vertebrate mRNAs . The deviations of Obs nucleotide frequencies from Exp in the AUG codon contexts in H-mRNAs are considerably higher in L-mRNAs. We suggest that more optimal context of initiation codon might be essential for higher translational efficiency of H-mRNAs
AUG codons in 5’UTRs of H- and L-mRNAs
AUG codon frequencies in 5’UTRs were found to differ significantly (P<0.001) for the sets of H- and L-mRNAs. AUG codons were found in 40 out of 295 H-mRNA leaders and in 112 out of 285 L-mRNA leaders. Hence, the proportion of the AUG-containing 5’UTRs in L-mRNAs was 3 times higher than in H-mRNAs. In 16 out of 40 AUG-containing 5’UTRs of H-mRNAs the AUG codons were in non-optimal context, i.e. with C or U at position -3 and a base other than G in position +4. As to L-mRNAs, 27 out of 112 mRNA 5’UTRs contained only non-optimal AUG codons. Leaders with the optimal AUG codon context were more frequent in 5’UTRs of L-mRNAs (18 out of 112 compared to 4 out of 40 for 5’UTRs of H-mRNAs).
The increased content of AUGs in 5’UTRs of L-mRNAs may be related to their greater length (Fig.1). To test this, we calculated the AUG frequencies in 5’UTRs of mammalian H- and L-mRNAs normalised to the respective 5’UTR lengths. Exp AUG frequencies were calculated according to formula: Paug=Pa· Pu· Pg, where Px is an Exp content of the nucleotide X in 5’UTRs of H- or L-mRNAs. The ratio of Obs to Exp AUG frequencies in 5’UTRs of L-mRNAs was significantly higher than for 5’UTRs of H-mRNAs (0.514 and 0.326, respectively). Therefore, greater length of 5’UTRs was not the major reason for the differences found between H-mRNAs and L-mRNAs. On the contrary, the ratios of Obs to Exp AUG frequencies in 3’UTRs of mammalian H- and L-mRNAs were virtually identical (0.93 and 0.94, respectively).
Termination signals in H- and L-mRNAs
The frequencies of three termination codons in the H- and L-mRNA sets were calculated for mammalian mRNAs. The order of frequencies is UAA (43.9%) > UGA (36.3%) > UAG (19.8%) for H-mRNAs and UGA (54.8%) > UAA (28.7%) > UAG (16.5%) for L-mRNAs. The UAA codon provides the strongest efficiency of translation termination in mammalian cells and in an in vitro system .
We have also analysed the Obs nucleotide frequencies in positions around each of three termination codons of mammalian mRNAs (Table 4). Strong deviations of Obs from Exp nucleotide frequencies in several positions were detected. Obviously, frequency of A and G in position +4 downstream of the stop codons in L-mRNAs is considerably lower than in H-mRNAs. It was noted earlier that the presence of a purine base in this position strongly enhanced translational termination efficiency in mammalian ce