LINEAR-ADDITIVE APPROXIMATION FOR PREDICTING ACTIVITY OF THE FUNCTIONAL DNA AND RNA SITES
N.A. Kolchanov, M.P. Ponomarenko, J.V. Ponomarenko, A.S. Frolov, N.L. Podkolodny#
Institute of Cytology and Genetics, Novosibirsk, Russia, 630090; kol@bionet.nsc.ru; #)Institute of Computational Mathematics and Mathematical Geophysics, Novosibirsk, Russia, 630090
ABSTRACT
We suggest a linear-additive approximation to predict activity of the functional DNA and RNA sites. Its novelties are (i) taking into account physico-chemical and conformational properties of DNA and RNA and (ii) one by one “generating and testing” a huge body of hypothesis such as “Is a given DNA/RNA feature responsible for a definite site activity?”. The considered features are calculated from conformational and physico-chemical properties, and also from oligonucleotide content of a given site. The such features are easily interpreted, but, of course, hardly guessed. That is why we suggest to generate and test in significance as many these features as computer can afford. This linear-additive approximation has been implemented for creating the distributed and intelligent database ACTIVITY for the functional site activities. Currently, about 60 experiments are described with this database. Besides, the database compiles 40 conformational and physico-chemical properties. There is also the knowledge base for the features found significant for predicting the site activities. It links within a library of the computer programs implementing these predictions. ACTIVITY is WWW-available in real-time mode “http://www.bionet.nsc.ru/SRCG/Activity/”.
INTRODUCTION
As is known, every molecular process in a cell, such as replication, transcription, splicing, and translation, is controlled by a definite set of functional sites. Thousands of such sites are presently known. Each site has a certain location and activity value. The methods recognizing sites rely on the data on site location stored in the EMBL Data Library, the GenBank Database and other compilations [1-4]. There are hundreds of methods in this area of intense research (for review, see [5]). From them, the consensus [6], neural networks [7] and weight matrix [8] are widely used. The methods recognizing sites locations are applied to annotate genomic DNA [9-12]. Functional sites of the same type can differ in activity by several orders of magnitude. Table 1 exemplifies that synthetic ssDNA’s differ 350-fold in the affinity for RecA-filament [13], mutant E. coli operators vary in the range of two orders affinity for Cro-repressor [14], and so on [15-17].
Mulligan et al. [18] have first predicted the activity Kbk2 of E. coli promoters through homology score. Using multiple regression, Stormo et al. [19] have optimized weight matrixes to predict the site activities for su2-suppression and 2-aminopurine induced mutations. Berg and von Hippel [20, 21] have generalized all the data within the framework of a statistical-mechanical theory and they have applied it to predict the activities of the CRP- and Cro-binding sites, and E. coli promoters. As for RNA, the weight matrix predicting the activities of the E. coli ribosome-binding site has been also optimized [22]. Jonsson et al. [23] have introduced neural networks to predict the E. coli promoter strength. Neural networks have made it possible to predict the activities of the INR-binding site and TATA box [24]. A system generating programs to predict site activities have been created [20] and applied to predict the consensus site maximizing affinity between DNA and TBP protein [25]. This consensus was found to be similar to Bucher’s consensus [1] of the TATA box.
All the large body of data on site activity appeared to be as informative as the widely used data on site locations for development of prediction methods. Therefore, the prediction of the site activity remains a challenging problem of computational biology. However, there are no databases for the site activities and, thus, no information sources to predict them. Nevertheless, there are hundreds of “sequence® activity” experimental data sets. When SRS query language [27] was introduced, the compiling data on site activities became feasible.
With this background, we developed the distributed and intelligent database ACTIVITY for the activities of the functional sites in DNA and RNA. It has the following: (i) the database for the experimental data on site sequences with known activities, (ii) the database for the conformational and physico-chemical properties of DNA and RNA, (iii) the knowledge base for the features significant for predicting the site activities, and (iv) the library for the programs predicting the site activities. ACTIVITY is WWW-available in “real-time” mode (http://www.bionet.nsc.ru/SRCG/Activity/).
METHOD
In this paper, we suggest linear-additive approximation for predicting activities of the functional DNA and RNA sites. Its biological novelty is taking into account physico-chemical and conformational DNA properties in addition to widely used contextual. Its computational novelty is using decision making theory [28] and Zadeh’s fuzzy logic [29] to generate and test a huge body of hypothesis such as “Is this DNA property responsible for a given site activity?”. Fig.1 demonstrates that (a) the concentration of the tetranucleotide VUKK within the SV40 pre-RNA cleavage point is responsible for the 3’processing efficiency, and (b) the major groove width of DNA is responsible for the Cro/DNA affinity. These type of DNA features is easily interpreted, but hardly guessed.
That is why we suggest to generate and test as many DNA features as computer can afford.
The core idea of linear-additive approximation implies that the site activity, F, is determined by simultaneous action of the site features, X’s, of two types: obligatory and facultative. The obligatory feature of a given site are invariant for all sequences of this site and determine its basal activity. Consensus is a typical obligatory feature. The facultative features of a given site are individual in terms of their “number, size and location” for each sequence of the site. They modulate the site activity with respect to basal level. Hence, within the framework of the linear-additive approximation, the activity of the site with sequence S can be calculated by the following equation:
, (1)
where, F0(S) equals the basal activity of this type sites when the sequence S has their obligatory features and equals “0” otherwise; Xk(S)’s are the values of the site facultative features of the sequence S; Fk’s are the contributions of the facultative features Xk to the site activity F.
Three types of facultative features are considered, namely, statistical, conformational, and physico-chemical. The weighted concentration of oligonucleotides Z={z1, ..., zj, ..., zm} of length m is used as a statistical feature of the nucleotide context of the sequence S={s1, ..., si, ..., sL} of length L:
, (2)
here, the so-called “d -function” is used that assumes the value "1" or "0" at each position i of the sequence S depending on the presence or absence of the oligonucleotide Z at this position:
where: siÎ {A, T, G, C}; zjÎ {A, T, G, C, W=A/T, R=A/G, M=A/C, K=T/G, Y=T/C, S=G/C, B=T/G/C, V=A/G/C, H=A/T/C, D=A/T/G, N=A/T/G/C}; m<<L.
The basic element of the facultative feature description is the function of position effect w(i). The function allows to take into account the fact that the same oligonucleotide contributes differently to the site activity depending on its location The function w(i) is determined by a simple rule: the more important is the position for the site function, the higher is its assigned weight w(i). The total number of the weighted functions w(i) used in the activity prediction is 180. The weight functions given in Fig.2 demonstrate the highest effect on the site activity of (a) the narrow region within the right half of the site, (b) its central part, (c) its terminal regions, and (d) the left terminus of the site.
As is known, local conformational DNA heterogeneities dependent on the nucleotide context play an important role in DNA-protein interactions, which essentially determine the site activity. That is why the prediction of site activity takes into account the DNA conformational properties describing the mutual orientation and locations of base pairs. The values of these parameters averaged for the known X-ray structures are used. Also the following physico-chemical properties are used: the melting temperature, persistent length, entropy, and others. These properties determine the conformational dynamics of DNA sites during their functioning. About 40 conformational and physico-chemical properties are utilized in prediction of the site activities.
Thus, the sequence of the site S can be characterized by the mean value of the q-th conformational or physico-chemical property Pq averaged for the region between positions a and b:
. (3)
It should be emphasized that before starting the analysis, we knew very little about the statistical, physico-chemical, or conformational features that were most important for the activity of the examined site. The only available data were the sequences with the known activities. With this in mind, the artificial intelligence principle of impartiality is applicable: “when the information is insufficient, the more hypotheses have been generated and tested, the more correct is the result, and no preference, therefore, might be given to any hypothesis before its testing”. In this paper, each hypothesis is the assumption that a statistical, conformational, or physico-chemical feature calculated by formulae (2) or (3) is significant for the activity of the examined site.
For this reason, in the analysis of statistical features, we test one by one all the possible variants of oligonucleotide Z varying (a) its length m from 1 to M; (b) its nucleotide composition in 15 single-letter based codes; and (c) all available functions of position effect w(i). Thus, the weighted concentration XZ,m,w(Sn) is calculated by formulae (2) for fixed combinations “Z, m, w” for each sequence Sn with known activity Fn. Hence, the total number of these statistical features generated and tested is about 107. Similarly, all the available conformational or physico-chemical property Pq and all the possible regions (a, b) within the examined site are considered one by one. In this way, for a fixed “q, a, b”, the conformational or physico-chemical feature Xq,a,b(Sn) is calculated by formula (3) for each sequence Sn with known activity Fn The total number of the features is about 105.
When so large number of statistical, conformational or physico-chemical features is generated and tested in significance for a given site activity, the problem of an insignificant feature chosen by chance becomes crucial. In this paper, we suggest to cross this problem within the framework of decision making theory [28] and Zadeh’s fuzzy logic [29] by the following way.
Lets calculate by formula (2) the fixed statistical feature Xzmw(Sn) for each sequence Sn with the known activity Fn. If the resulting pairs {Xzmw(Sn), Fn} meet all the possible conditions of the linear regression (formula 1) applicability, then activity F is predictable from an arbitrary sequence S via the feature Xzmw(S). To test the conditions, a simple regression is optimized for the pairs {Xzmw(Sn), Fn}:
; (4)
where: f0 and f1 are the regression coefficients optimized for the pairs {Xzmw(Sn), Fn} [30].
To ensure the reliability of the regression between the Xzmw(Sn) and Fn values, 22 conditions of regression analysis are tested, namely, the presence of linear, sign, and rank correlations between the predicted Fzmw(Sn) and experimental Fn activities; the equality of distributions of these values, the Gaussian distribution of their deviation (Fzmw(Sn)-Fn) and so on. When testing each of the 22 conditions, the significance level a r at which the r-th condition is met is estimated (where: 1£ r £ 22). Within the framework Zadeh’s fuzzy logic [29], each estimation a r is transformed into uniform scale that is so-called “partial utility of the usage of the feature XZmw to predict the activity F”, as follows:
(5)
The highest partial utility ur=1 is assigned to the feature Xzmw, if the r-th condition is met at significance a r <0.01. The utility is lowest, ur=-1, if the r-th condition is not met (a r > 0.1). The intermediate partial utility urÎ [-1, 1] is assigned to the feature Xzmw met the r-th condition with the intermediate a rÎ [0.01, 0.1]. Within the framework of decision making theory [28], the averaging all the 22 partial utilities gives the integral utility of the usage of the feature XZmw to predict the activity F:
. (6)
Only the linearly independent features XZ,m,w with the highest positive utility are selected:
. (7)To have the positive utility U(XZ,m,w,F), the statistical feature XZ,m,w needs to met at least a half of the 22 conditions of the linear regression applicability. Thus, the probability of a feature X with positive utility U(X, F) selected by chance from 107 features can be estimated with the binomial distribution, such as:
. (8)
Formula (8) shows that each statistical feature XZ,m,w selected by formula (7) met significantly the linear regression applicability for predicting the site activity. The same is for the conformational and physico-chemical feature Xq,a,b. That is why this selection can be one by one generating and testing.
ALGORITHM
The simple combinatorial algorithm used is schematically shown in Fig.3. This algorithm means the following: all the possible features X(Sn)’s for all the available site sequences Sn’s with known activities Fn’s are calculated by formulae (2) and (3), and also all the possible utilities U(X,F) are estimated by formulae (4), (5) and (6). When all U(X, P)£ 0, the algorithm terminates without features selected, but, in contrast, when U(X, P)>0, all the possible linear-independent features {Xk}1£ k£ K with highest positive {U(Xk, F)>0}1£ k£ K are selected. Basing on these features {Xk}1£ k£ K selected, the linear-additive approximation (formula 1) to predict the site activity is derived, and, finally, the C-code implementing this prediction is generated and stored [25].
This algorithm has been implemented with Borland C compiler on IBM PC platform to develop the distributed and intelligent database ACTIVITY presented schematically in Fig.4. It contains the following: three databases, computer system generating programs to predict the site activity and the library for the computer programs predicting activities of the sites. ACTIVITY is WWW-available in “real-time” mode through URL “http://www.bionet.nsc.ru/SRCG/Activity/”.
RESULT AND DISCUSSION
The most important unit of the ACTIVITY is the database for DNA and RNA site activity. Currently, it describes more than 70 samples exemplified in Table 2. Among them are promoters and binding sites for different E. coli regulatory proteins, TATA-boxes and binding sites for various eukaryotic transcription factors, translation starts, splicing and 3’processing sites, mutation hotspots and many others. The parameters characterizing specific site activities include the association and dissociation rate, affinity, lifetime of the complexes, product concentrations controlled by these sites, transcription and translation efficiencies, mutation and cutting frequencies, etc. Fig.5 gives the database format by using as example the E. coli promoter strength in terms of “-log[Pbla]” units [23].
ACTIVITY has also the database for conformational and physico-chemical properties of DNA. The current version of the database contains about 40 properties, some of them are listed in Table 3. As an example, Fig.6 gives the presentation of the B-helical angle “Direction” in the database.
These two database are initial for the computer system to generate programs predicting the site activity. For initial data on the E. coli promoter strength (Fig.5), the ACTIVITY output is demonstrated in Fig.7. This output is stored into the knowledge base for the significant features for predicting activity of the site (see scheme in Fig.4). Fig.8 illustrates what does this knowledge mean. The concentration of the trinucleotide ASM weighted by the function w(i) given in Fig.2a correlates significantly with the promoter strength (Fig.8a). The function w(i) assigns the highest weighs to the region (-1; 11). It means that the trinucleotide ASM near the transcription start gives the highest contribution to the promoter strength. The Direction averaged for the region (-5; 15) also correlates with the promoter strength (Fig 8b). Basing on these two features, the linear-additive approximation (1) for predicting the strength is derived (Table 4). Fig.8c compares the experimental and predicted E. coli promoter strength. The linear correlation coefficient r=0.91 means the significant agreement between the experiment and prediction.
Also, Table 4 presents some dozens of eukaryotic promoter sites analyzed by Activity to demonstrate the universality of the linear-additive approximation (1) in this field of intense research. For all these sites, the significant statistical, physico-chemical, or conformational features have been identified and the linear-additive approximations predicting the site activities have been derived.
Analysis of the mouse a A-crystalline gene promoter (Fig.9a) showed that the best physico-chemical feature of its PE1B region near TATA box is the probability to be contacting with nucleosome core. This feature negatively correlates with transcription activity that means the tighter is the interaction of the promoter with nucleosomes, the lower is the transcription activity. This result is consistent with the experimental data showing that nucleosome displacement from a promoter precedes the TBP/TATA-binding [46, 47]. The performed analysis has demonstrated also that such conformational features as the major groove dist (Fig.9b) and the Tilt (Fig.9c) are of great importance for the transcription activity. Using these three features, the linear-additive approximation predicting the transcription activity was derived (Table 4) and tested (Fig.9d).
Analysis of the sequences with known DNA bending in the TBP/TATA complex demonstrated that the bending increases with the inclination (Fig.10a). Similar results were obtained by the X-ray analysis of the TBP/TATA complexes. DNA bending in these complexes was shown to result from intercalation of four phenylalanine residues of the TBP between a pair of adjacent bases on the side of the minor groove [48]. Inclination describes the rotation angle of a pair of bases along the short axis of the pair, and the increase in the angle widens the minor groove [49], thereby facilitating the intercalation of phenylalanines on the minor groove and DNA bending.
The linear-additive approximation (1) can be also applied to synthetic analogues of sites and their mutational variants. Study of the synthetic analogues of the TATA-boxes with known TBP affinity revealed two significant features: (i) the weighted concentration of the dinucleotide TV, contributing primarily to the TBP affinity in the center of the site (Fig.2b); and (ii) the weighted concentration of the dinucleotide WR chiefly contributing to the affinity at the site termini (Fig.2c). The linear-additive approximation predicting the TBP/DNA affinity were derived using these two statistical features (Table 4). An agreement between the experimental and predicted affinities is shown in Fig.10b.
Fig.11 demonstrates that, for an arbitrary site, any conformational (a, b), physico-chemical (d), and also statistical (d) DNA and RNA features can appear to be significant for predict the site activity.
Summing up, we would like to underline that the linear-additive approximation (formula 1) derived for predicting site activities can be helpful in a wide range of investigations in molecular biology. Substantially, ACTIVITY does not require huge body of initial experimental data. It is completely automated and WWW-available (http://www.bionet.nsc.ru/SRCG/Activity/).
This work was supported by grants from the Russian Foundation for Basic Research.
REFERENCES
1. P. Bucher, J. Mol. Biol., 212, 563 (1990)
2. I. Ioshikhes and E.N. Trifonov, Nucleic Acids Res. 21, 4857 (1993)
3. J.D. Helmann, Nucleic Acids Res. 23, 2351 (1995)
4. E. Wingender, A.E. Kel, et al., Nucleic Acids Res., 25, 265 (1997)
5. M.S. Gelfand, J. Comput. Biol., 2, 87 (1995)
6. S. Karlin and V. Brendel, Science, 257, 39 (1992)
7. E.C. Uberbacher, Y. Xu, and R.J. Mural, Methods Enzymol., 266, 259 (1996)
8. Q.K. Chen, G.Z. Hertz, and G.D. Stormo, CABIOS, 13, 29 (1997)
9. J.W. Fickett, Trends Genet., 12, 316 (1996)
10. R. Guigo and J.W. Fickett, J. Mol. Biol., 253, 51 (1995).
11. V.V. Solovyev, A.A. Salamov, and C.B. Lawrence, Nucleic Acids Res., 22, 5156 (1994).
12. E.E. Snyder and G.D. Stormo, J. Mol. Biol., 248, 1 (1995).
13. A.V. Mazin and S.C. Kowalczykowski, Proc. Natl. Acad. Sci. U.S.A., 93, 10673 (1996)
14. J.G. Kim, Y. Takeda, B.W. Matthews, and W.F. Anderson, J. Mol. Biol., 196, 149 (1987)
15. C. Coulondre, J.H. Miller, P.J. Farabaugh, and W. Gilbert, Nature, 274, 775 (1978)
16. A. Gil and N.J. Proudfoot, Cell, 49, 399 (1987)
17. A.A. Sokolenko, I.I. Sadomirsky, and L.K. Savinkova, Mol. Biol. (Msk), 30, 279 (1996).
18. M.E. Mulligan, D.K. Hawley, et al., Nucleic Acids Res., 12, 789 (1984)
19. G.D. Stormo, T.D. Schneider, and Gold, L. (1986) Nucleic Acids Res., 14, 6661 (1986).
20. O.G. Berg and P.H. von Hippel, J. Mol. Biol., 193, 723 (1987)
21. O.G. Berg and P.H. von Hippel, J. Mol. Biol., 200, 709 (1988)
22. D. Barrick, K. Villanueba, et al., Nucleic Acids Res., 22, 1287 (1994)
23. J. Jonsson, T. Norberg, et al., Nucleic Acids Res., 21, 733 (1993)
24. R.J. Kraus, E.E. Murray, et al., Nucleic Acids Res., 24, 1531 (1996)
25. M.P. Ponomarenko, A.N. Kolchanova, and N.A. Kolchanov, J. Comput. Biol., 4, 83 (1997)
26. M.P. Ponomarenko, L.K. Savinkova, et al., Mol. Biol (Msk), 31, 726 (1997)
27. T. Etzold and P. Argos, CABIOS, 9, 49 (1993)
28. P.C. Fishburn, Utility theory for decision making, New York, Jonh Wiley & Sons (1970).
29. L.A. Zadeh, Information and Control, 8, 338 (1965)
30. E. Forster and B. Ronr, Methoden der korrelations- und regressions analyse, Berlin, Verlag Die Wirtschaft (1979)
31. M.R. Gartenberg and D.M. Crothers, Nature, 333, 824 (1988)
L.W. Chiang and M.M. Howe, Genetics, 135, 619 (1993)
33. D.B. Starr, B.C. Hoopes, and D.K. Hawley, J. Mol. Biol., 250, 434 (1995)
34. D. Boyd et al., J. Mol. Biol., 253, 677 (1995)
35. A.J. Bendall and P.L. Molloy, Nucleic Acids Res., 22, 2801 (1994)
36. C.M. Sax, A. Cvelk, et al., Nucleic Acids Res., 23, 442 (1995)
37. A. Kretsovali, and J. Papamatheakis, Nucleic Acids Res., 23, 2919 (1995)
38. M. McDevitt et al., EMBO J., 5, 2907 (1986)
39. C.F. Lesser and C. Guthrie, Genetics, 131, 851 (1993)
40. H. Karas, R. Knuppel, W. Schulz, H. Sklenar, and E. Wingender, CABIOS, 12, 441 (1996)
41. A.A. Gorin, V.B. Zhurkin, and W.K. Olson, J. Mol. Biol., 247, 34 (1995)
42. E.S. Shpigelman, E.N. Trifonov, and A. Bolshoy, CABIOS, 9, 435 (1993)
43. M. Suzuki, N. Yagi, and J.T. Finch, FEBS L., 397, 148 (1996)
44. M.E. Hogan and R.H. Austin, Nature, 329, 263 (1987)
45. N. Sugimoto, S. Nakano, M. Yoneyama, and K. Honda, Nucleic Acids Res., 24, 4501 (1996)
46. D.G. Edmondson and S.Y. Roth, FASEB J., 10, 1173 (1996)
47. J.S. Godde, Y. Nakatani, and A.P. Wolffe, Nucleic Acids Res., 23, 4557 (1995)
48. Z.S. Juo, T.K. Chiu, et al., J. Mol. Biol. 261, 239 (1996).
49. EMBO Workshop, EMBO J., 8, 1 (1989)
Table 1. The sites of the same type can differ in activity by several orders of magnitude
Site name |
Site sequence |
Activity |
DNA/RecA-filament affinity | CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC |
350 |
in E. coli [13] | AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA |
1 |
Cro/DNA-affinity in | TAATGTGAGTTAGCTCACTCAT |
91 |
E. coli [14] | TAATGTAAGTTAGCTCACTCAT |
1 |
2-aminopurine inducted | CGCGTGGTGAACCAGGCCAGCCACG |
51 |
mutations C->T [15] | ACCACCATCAAACAGGATTTTCGCC |
1 |
3’processing pre-mRNA | UUCUACCGGAUCGUUGUGUUCGAGG |
13 |
in SV40 [16] | UUCUACCGGAUCGUUUUGGUCGAGG |
1 |
TBP/TATA-affinity in | GGGGCTATAAAAGGGGGTGG |
7 |
yeast [17] | GTACCTATGGGTCTGCTGGT |
1 |
Table 2. Examples of the sites with known activities available in the database ACTIVITY
Site |
Activity |
Ref. |
||||
Name |
Sequences |
Parameter |
Scale |
Min |
Max |
|
Cro-binding site | Natural |
Association rate constant | ln |
19.1 |
19.9 |
[14] |
CRP-binding site | Natural |
Affinity CRP/DNA | ln |
-3.2 |
3.2 |
[31] |
E. coli promoter | Mutant |
Promoter strength | -log |
0.26 |
2.1 |
[23] |
C-protein binding site | Mutant |
Transcription activity | ln |
-6.2 |
1.8 |
[32] |
TATA box | Mutant |
TBP/DNA lifetime | minute |
1 |
185 |
[33] |
TATA box | Mutant |
Bend, DNA/TBP complex | degree |
33 |
106 |
[33] |
Transcription signal INR | Mutant |
Affinity INR/DNA | ln |
-4.6 |
1.3 |
[24] |
Transcription signal OCT-1 | Mutant |
Transcription activity | ln |
-2.3 |
0.63 |
[34] |
Transcription signal USF | Synthetic |
Affinity USF/DNA | ln |
3.8 |
100 |
[35] |
PE1B box (adjacent TATA box) | Mutant |
Transcription activity | ln |
-1.4 |
1.4 |
[36] |
Transcription signal IL-1 | Mutant |
Transcription activity | ln |
-1.9 |
4.1 |
[37] |
Pre-mRNA 3’cleavage site | Mutant |
Cleavage efficiency | % |
3 |
289 |
[38] |
Pre-mRNA donor splice site | Mutant |
Cleavage efficiency | % |
18 |
100 |
[39] |
E. coli ribosome-binding site | Synthetic |
Translation activity | ln |
0.0 |
8.06 |
[22] |
2-aminopurine induced mutation | Natural |
Mutation frequency | ln |
0.0 |
5.6 |
[15] |
Table 3. Examples of the DNA properties available in the database ACTIVITY
Property name |
Unit |
Min |
Max |
Ref. |
Conformational: |
||||
Twist |
Degree |
31.1 |
41.4 |
[40] |
Propeller |
Degree |
-17.3 |
-6.7 |
[41] |
Tip |
Degree |
-1.64 |
6.7 |
[40] |
Inclination |
Degree |
-1.43 |
1.43 |
[40] |
Tilt |
Degree |
-2.6 |
0.6 |
[41] |
Bend |
Degree |
2.16 |
6.74 |
[40] |
Wedge |
Degree |
1.1 |
8.4 |
[42] |
Direction |
Degree |
-154 |
180 |
[42] |
Roll |
Degree |
-6.2 |
6.2 |
[43] |
Rise |
Angstrom |
3.16 |
4.08 |
[40] |
Slide |
Angstrom |
-0.4 |
1.6 |
[43] |
Minor groove width (width) |
Angstrom |
4.62 |
6.40 |
[40] |
Minor groove depth (depth) |
Angstrom |
8.79 |
9.11 |
[40] |
Minor groove width size (size) |
Angstrom |
2.7 |
4.7 |
[41] |
Minor groove width distance (dist) |
Angstrom |
2.79 |
4.24 |
[41] |
Major groove width (WIDTH) |
Angstrom |
12.1 |
15.5 |
[40] |
Major groove depth (DEPTH) |
Angstrom |
8.45 |
9.60 |
[40] |
Major groove size (SIZE) |
Angstrom |
3.26 |
4.70 |
[41] |
Major groove distance (DIST) |
Angstrom |
3.02 |
3.81 |
[41] |
Physico-chemical: |
||||
Clash strength |
f |
0.00 |
2.53 |
[41] |
Bending mobility to minor groove |
m |
1.02 |
1.27 |
[31] |
Bending mobility to major groove |
m |
0.99 |
1.18 |
[31] |
Persistent length |
nl |
20 |
130 |
[44] |
Melting temperature |
o C |
36.7 |
136.1 |
[44] |
Probability to be contacting nucleosome core |
% |
1 |
18 |
[44] |
Enthalpy change |
kcal/mol |
-11.8 |
-5.6 |
[45] |
Entropy change |
cal/mol/K |
-28.4 |
-15.2 |
[45] |
Free energy change |
kcal/mol |
-2.8 |
-0.9 |
[45] |
Table 4. Examples of the functional DNA and RNA sites analyzed by the system ACTIVITY
Site |
Feature selected |
Significance |
|||||||
Name |
Position “1” |
n |
Activity, F |
Xk | Region |
Property |
U |
r |
a |
E. coli promoters |
Transcription |
27 |
Strength |
X1 | Fig.2a |
[ASM] |
0.59 |
0.86 |
10-2 |
[23] |
start |
X2 | -5; 15 |
Direction |
0.50 |
0.71 |
10-2 |
||
F=0.3+0.6´ X1+0.0008´ X2 | 0.91 |
10-4 |
|||||||
PE1B region adjacent to |
Transcription |
10 |
Transcription |
X1 | -32; -25 |
Pnucl |
0.36 |
-0.77 |
10-2 |
the TATA box of |
start |
activity |
X2 | -29; -19 |
DIST |
0.41 |
0.86 |
10-3 |
|
the a A-crystalline |
X3 | -31; -25 |
Tilt |
0.38 |
-0.78 |
10-2 |
|||
promoter [36] |
F=-39-0.1´ X1+12´ X2-X3 | 0.90 |
10-4 |
||||||
TATA boxes |
Synthetic |
19 |
Affinity to |
X1 | Fig.2b |
[TV] |
0.35 |
0,73 |
10-2 |
(synthetic) |
DNA |
yeast TBP |
X2 | Fig.2c |
[WR] |
0.41 |
0,76 |
10-2 |
|
[17] |
start |
F=14.5+2.5´ X1+0.9´ X2 | 0,77 |
10-2 |
|||||
TATA boxes |
TATA box |
9 |
DNA bending |
X1 | 0, 9 |
Inclination |
0.19 |
0.76 |
0.05 |
(mutant) [33] |
start |
TBP/TATA |
F=120.15+70.32´ X1 | 0.76 |
0.05 |
||||
USF binding site |
Synthetic |
14 |
Affinity |
X1 | 11, 15 |
depth |
0.22 |
-0.78 |
10-3 |
(synthetic) [35] |
DNA start |
USF/DNA |
X2 | 11; 20 |
Twist |
0.23 |
-0.86 |
10-4 |
|
F=170-16.3´ X1-0.7´ X2 | 0.91 |
10-5 |
|||||||
CRP-binding site [31] |
Center of the |
10 |
Affinity |
X1 | -15; 14 |
Rise |
0,15 |
-0,86 |
10-2 |
consensus |
CRP/DNA |
X2 | -17; 12 |
width |
0.06 |
0,78 |
10-2 |
||
repeat |
F=190-66.8´ X1+7.5´ X2 | 0.87 |
10-2 |
||||||
2-aminopurine induced |
Mutation |
26 |
Mutation |
X1 | -1, 2 |
Tmelt |
0,20 |
0,90 |
10-5 |
mutations C->T [15] |
point |
frequency |
F=-8.5568+0.1585´ X1 | 0,90 |
10-5 |
||||
ssDNA/RecA-filament |
Synthetic |
15 |
DNA/RecA |
X1 | Fig.2d |
[DRV] |
0,27 |
-0,89 |
10-5 |
(synthetic) [13] |
DNA start |
affinity |
F=0.54 - 1.03 ´ X1 | 0,89 |
10-5 |
||||
the SV40 pre-mRNA |
RNA cutting |
16 |
Cutting |
X1 | Fig.2a |
[VUKK] |
0.24 |
0,76 |
10-4 |
3’processing site [16] |
point |
frequency |
F=-301.72+216.16´ X1 | 0,76 |
10-4 |
||||
Cro-binding site [14] |
Consensus |
7 |
Affinity |
X1 | 1; 16 |
width |
0,55 |
0.97 |
10-3 |
start |
Cro/DNA |
X2 | 6; 19 |
Roll |
0,44 |
0.90 |
10-3 |
||
X3 | 6, 19 |
Rise |
0,41 |
0.92 |
10-2 |
||||
F=-72+4´ X1+X2+13´ X3 | 0.99 |
10-5 |
Notes: n, the total number of the site variants; Xk, the selected context-dependent feature; U, utility value; r, linear correlation coefficient; a , significance of the linear correlation coefficient value; [Z], the concentration of the oligonucleotide Z; weighted with the weighted function w(i) given in Fig.2 (formula 2); Pnucl, probability to be contacting with nucleosome core; Tmelt, melting temperature; depth, minor groove depth; width, minor groove width; WIDTH, major groove width; DIST, major groove dist; F=F0+S k=1,K Fk´ Xk, linear-additive approximation (formula 1) predicting the site activity.
FIGURE LEGENDS
Fig. 1. The tetranucleotide VUKK concentration is responsible for the SV40 pre-mRNA 3’processing efficiency (a); and the major groove width is responsible for the Cro/DNA affinity (b).
Fig. 2. Examples of the weight functions w(i) modeling the highest effect of oligonucleotides located within the site 3’-half (a), central part (b), termini (c) and near 5’-terminus (d) on site activity.
Fig. 3. Algorithm for generating the C-code programs to predict site activities (where: the indexes “f g l ” are either the indexes “Z,m,w” or the indexes “q,a,b” respectively in formulae (2) or (3)).
Fig. 4. A scheme of the distributed and intelligent database ACTIVITY.
Fig. 5. The description of experimental data on the E.coli promoter strength [23] within ACTIVITY: MI, entity identifier; MN, sample name; OG, genome region; OS, species; FF, site; AN, activity name; AU, activity unit; SC, variant; SQ, sequence; SA, activity value.
Fig. 6. The description of the conformational property “Direction” [42] within ACTIVITY: MI, entity identifier; MN, property type; MD, molecule; ML, step; PN, property name; PM, identifying method; PU, property unit; DINUCLEOTIDE, property values.
Fig. 7. The ACTIVITY result of the E.coli promoter strength. Fields: MI, entity identifier; MN, sample name; CF, feature type; PV, property/oligonucleotide; AB, region; UT, utility; LC, linear correlation coefficient; C-CODE of the computer program calculating the feature.
Fig. 8. An interpretation of the ACTIVITY result of the E.coli promoter strength: (a) the trinucleotide ASM concentration correlates with the promoter strength; (b) the Direction correlates with the strength (r=0.71); (c) the agreement between the experiment and prediction.
Fig. 9. The ACTIVITY result of the transcription activity of the mouse a A-crystalline gene promoter containing the PE1B region near TATA-box [36]: (a) the probability to be contacting with nucleosome core correlates negatively with the transcription activity; (b) the major groove dist correlates positively with the transcription activity; (c) the tilt correlates negatively with the transcription activity; (d) the agreement between the experimental and prediction data (Table 4).
Fig. 10. The ACTIVITY result of the TATA boxes: (a) the agreement between the experimental [17] and predicted TBP/DNA affinity; (b) the DNA bend within the TBP/TATA complex [33] correlates with the inclination.
Fig. 11. Examples of ACTIVITY-results: (a) the USF/DNA affinity [35] correlates with the twist; (b) the CRP/DNA affinity [31] correlates with the rise; (c) the frequency of the mutation induced by 2-aminopurine [15] correlates with the DNA melting temperature; (d) the ssDNA/RecA-filament affinity [13] correlates with the trinucleotide DRV concentration weighted by the weight function w(i) given in Fig.2d.
a) b)
a) b)
c) d)
MI K0000001
MN E. coli promoter strength in terms of -log[Pbla] units
CF Statistical FEATURE
PV ASM
AB -49 18
UT 0.589
LC 0.860
C-CODE
double WeightASM_for_EcPbla (char *s){
double X; char *seq; int i,k, SiteLength=68;
double Weigth5P0 [66]={
/* -49 -48 -47 -46 -45 -44 -43 */
0.100, 0.100, 0.100, 0.100, 0.100, 0.100, 0.100,
......................................................
/* 11 12 13 14 15 16 */
0.525, 0.356, 0.207, 0.143, 0.103, 0.100 };
seq=&s[0]; if(strlen(seq)<SiteLength+1)return(-1001.);
for (i=0, X=0.;i<SiteLength-1;i++) {
if(seq[i ]=='A')
if(seq[i ]=='G' || seq[i ]=='C')
if(seq[i+1]=='A' || seq[i+1]=='C') X+=Weight5P0[i]; }
return(X);};
//
CF Conformational FEATURE
PV Direction
AB -5 15
UT 0.502
LC 0.710
C-CODE
double Direction_for_EcPbla (char *s){
double X; char *seq; int i,k, SiteLength=21;
double DinucPar[16]={
/* AA AT AG AC TA TT TG TC */
-154., 0., 2., 143., 0., 154., 64.,-120.,
/* GA GT GG GC CA CT CG CC */
120.,-143., 57., 180., -64., -2., 0., -57. };
seq=&s[0]; if(strlen(seq)<SiteLength+1)return(-1001.);
for (i=0, X=0.;i<SiteLength-1;i++) {
switch (seq[i ]) { case 'A': k= 0; break;
....................................................
default : return(-1002.); }
switch (seq[i+1]) { case 'A': k+=0; break;
....................................................
default : return(-1003.); }
if (k > 15) return(-1004.); X+=DinucPar[k]; }
return (X/(double)(SiteLength-1));};
//
CF PREDICTION ACTIVITY
LC 0.910
C-CODE
double EcPbla_by_WeightASM_Direction (char *s){
extern double WeightASM_for_EcPbla (char *);
extern double Direction_for_EcPbla (char *);
double x1,x2; char *seq; int s1=0, s2=45, SiteLength=68;
seq=&s[0]; if(strlen(seq)<SiteLength+1)return(-1001.);
seq=&s[s1]; x1=WeightASM_for_EcPbla (seq); if(x1<-999.)return(x1);
seq=&s[s2]; x2=Direction_for_EcPbla (seq); if(x2<-999.)return(x2);
return (0.307547 + 0.576596*x1 + 0.000799*x2);}
//
a)b)c)
a) b)
c) d)
a) b)
a) b)
c) d)