THE FUNCTIONAL SITES OF PRO- AND EUKARYOTIC GENOMES:

LINEAR-ADDITIVE APPROXIMATION FOR PREDICTING ACTIVITY OF THE FUNCTIONAL DNA AND RNA SITES

N.A. Kolchanov, M.P. Ponomarenko, J.V. Ponomarenko, A.S. Frolov, N.L. Podkolodny^#

Institute of Cytology and Genetics, Novosibirsk, Russia, 630090; kol@bionet.nsc.ru; ^#)Institute of Computational Mathematics and Mathematical Geophysics, Novosibirsk, Russia, 630090

ABSTRACT

We suggest a linear-additive approximation to predict activity of the functional DNA and RNA sites. Its novelties are (i) taking into account physico-chemical and conformational properties of DNA and RNA and (ii) one by one “generating and testing” a huge body of hypothesis such as “Is a given DNA/RNA feature responsible for a definite site activity?”. The considered features are calculated from conformational and physico-chemical properties, and also from oligonucleotide content of a given site. The such features are easily interpreted, but, of course, hardly guessed. That is why we suggest to generate and test in significance as many these features as computer can afford. This linear-additive approximation has been implemented for creating the distributed and intelligent database ACTIVITY for the functional site activities. Currently, about 60 experiments are described with this database. Besides, the database compiles 40 conformational and physico-chemical properties. There is also the knowledge base for the features found significant for predicting the site activities. It links within a library of the computer programs implementing these predictions. ACTIVITY is WWW-available in real-time mode “http://www.bionet.nsc.ru/SRCG/Activity/”.

INTRODUCTION

As is known, every molecular process in a cell, such as replication, transcription, splicing, and translation, is controlled by a definite set of functional sites. Thousands of such sites are presently known. Each site has a certain location and activity value. The methods recognizing sites rely on the data on site location stored in the EMBL Data Library, the GenBank Database and other compilations [1-4]. There are hundreds of methods in this area of intense research (for review, see [5]). From them, the consensus [6], neural networks [7] and weight matrix [8] are widely used. The methods recognizing sites locations are applied to annotate genomic DNA [9-12]. Functional sites of the same type can differ in activity by several orders of magnitude. Table 1 exemplifies that synthetic ssDNA’s differ 350-fold in the affinity for RecA-filament [13], mutant E. coli operators vary in the range of two orders affinity for Cro-repressor [14], and so on [15-17].

Mulligan et al. [18] have first predicted the activity K_bk₂ of E. coli promoters through homology score. Using multiple regression, Stormo et al. [19] have optimized weight matrixes to predict the site activities for su2-suppression and 2-aminopurine induced mutations. Berg and von Hippel [20, 21] have generalized all the data within the framework of a statistical-mechanical theory and they have applied it to predict the activities of the CRP- and Cro-binding sites, and E. coli promoters. As for RNA, the weight matrix predicting the activities of the E. coli ribosome-binding site has been also optimized [22]. Jonsson et al. [23] have introduced neural networks to predict the E. coli promoter strength. Neural networks have made it possible to predict the activities of the INR-binding site and TATA box [24]. A system generating programs to predict site activities have been created [20] and applied to predict the consensus site maximizing affinity between DNA and TBP protein [25]. This consensus was found to be similar to Bucher’s consensus [1] of the TATA box.

All the large body of data on site activity appeared to be as informative as the widely used data on site locations for development of prediction methods. Therefore, the prediction of the site activity remains a challenging problem of computational biology. However, there are no databases for the site activities and, thus, no information sources to predict them. Nevertheless, there are hundreds of “sequence® activity” experimental data sets. When SRS query language [27] was introduced, the compiling data on site activities became feasible.

With this background, we developed the distributed and intelligent database ACTIVITY for the activities of the functional sites in DNA and RNA. It has the following: (i) the database for the experimental data on site sequences with known activities, (ii) the database for the conformational and physico-chemical properties of DNA and RNA, (iii) the knowledge base for the features significant for predicting the site activities, and (iv) the library for the programs predicting the site activities. ACTIVITY is WWW-available in “real-time” mode (http://www.bionet.nsc.ru/SRCG/Activity/).

METHOD

In this paper, we suggest linear-additive approximation for predicting activities of the functional DNA and RNA sites. Its biological novelty is taking into account physico-chemical and conformational DNA properties in addition to widely used contextual. Its computational novelty is using decision making theory [28] and Zadeh’s fuzzy logic [29] to generate and test a huge body of hypothesis such as “Is this DNA property responsible for a given site activity?”. Fig.1 demonstrates that (a) the concentration of the tetranucleotide VUKK within the SV40 pre-RNA cleavage point is responsible for the 3’processing efficiency, and (b) the major groove width of DNA is responsible for the Cro/DNA affinity. These type of DNA features is easily interpreted, but hardly guessed.

That is why we suggest to generate and test as many DNA features as computer can afford.

The core idea of linear-additive approximation implies that the site activity, F, is determined by simultaneous action of the site features, X’s, of two types: obligatory and facultative. The obligatory feature of a given site are invariant for all sequences of this site and determine its basal activity. Consensus is a typical obligatory feature. The facultative features of a given site are individual in terms of their “number, size and location” for each sequence of the site. They modulate the site activity with respect to basal level. Hence, within the framework of the linear-additive approximation, the activity of the site with sequence S can be calculated by the following equation:

, (1)

where, F₀(S) equals the basal activity of this type sites when the sequence S has their obligatory features and equals “0” otherwise; X_k(S)’s are the values of the site facultative features of the sequence S; F_k’s are the contributions of the facultative features X_k to the site activity F.

Three types of facultative features are considered, namely, statistical, conformational, and physico-chemical. The weighted concentration of oligonucleotides Z={z₁, ..., z_j, ..., z_m} of length m is used as a statistical feature of the nucleotide context of the sequence S={s₁, ..., s_i, ..., s_L} of length L:

, (2)

here, the so-called “d -function” is used that assumes the value "1" or "0" at each position i of the sequence S depending on the presence or absence of the oligonucleotide Z at this position:

where: s_iÎ{A, T, G, C}; z_jÎ{A, T, G, C, W=A/T, R=A/G, M=A/C, K=T/G, Y=T/C, S=G/C, B=T/G/C, V=A/G/C, H=A/T/C, D=A/T/G, N=A/T/G/C}; m<<L.

The basic element of the facultative feature description is the function of position effect w(i). The function allows to take into account the fact that the same oligonucleotide contributes differently to the site activity depending on its location The function w(i) is determined by a simple rule: the more important is the position for the site function, the higher is its assigned weight w(i). The total number of the weighted functions w(i) used in the activity prediction is 180. The weight functions given in Fig.2 demonstrate the highest effect on the site activity of (a) the narrow region within the right half of the site, (b) its central part, (c) its terminal regions, and (d) the left terminus of the site.

As is known, local conformational DNA heterogeneities dependent on the nucleotide context play an important role in DNA-protein interactions, which essentially determine the site activity. That is why the prediction of site activity takes into account the DNA conformational properties describing the mutual orientation and locations of base pairs. The values of these parameters averaged for the known X-ray structures are used. Also the following physico-chemical properties are used: the melting temperature, persistent length, entropy, and others. These properties determine the conformational dynamics of DNA sites during their functioning. About 40 conformational and physico-chemical properties are utilized in prediction of the site activities.

Thus, the sequence of the site S can be characterized by the mean value of the q-th conformational or physico-chemical property P_q averaged for the region between positions a and b:

. (3)

It should be emphasized that before starting the analysis, we knew very little about the statistical, physico-chemical, or conformational features that were most important for the activity of the examined site. The only available data were the sequences with the known activities. With this in mind, the artificial intelligence principle of impartiality is applicable: “when the information is insufficient, the more hypotheses have been generated and tested, the more correct is the result, and no preference, therefore, might be given to any hypothesis before its testing”. In this paper, each hypothesis is the assumption that a statistical, conformational, or physico-chemical feature calculated by formulae (2) or (3) is significant for the activity of the examined site.

For this reason, in the analysis of statistical features, we test one by one all the possible variants of oligonucleotide Z varying (a) its length m from 1 to M; (b) its nucleotide composition in 15 single-letter based codes; and (c) all available functions of position effect w(i). Thus, the weighted concentration X_Z,m,w(S_n) is calculated by formulae (2) for fixed combinations “Z, m, w” for each sequence S_n with known activity F_n. Hence, the total number of these statistical features generated and tested is about 10⁷. Similarly, all the available conformational or physico-chemical property P_q and all the possible regions (a, b) within the examined site are considered one by one. In this way, for a fixed “q, a, b”, the conformational or physico-chemical feature X_q,a,b(S_n) is calculated by formula (3) for each sequence S_n with known activity F_n The total number of the features is about 10⁵.

When so large number of statistical, conformational or physico-chemical features is generated and tested in significance for a given site activity, the problem of an insignificant feature chosen by chance becomes crucial. In this paper, we suggest to cross this problem within the framework of decision making theory [28] and Zadeh’s fuzzy logic [29] by the following way.

Lets calculate by formula (2) the fixed statistical feature X_zmw(S_n) for each sequence S_n with the known activity F_n. If the resulting pairs {X_zmw(S_n), F_n} meet all the possible conditions of the linear regression (formula 1) applicability, then activity F is predictable from an arbitrary sequence S via the feature X_zmw(S). To test the conditions, a simple regression is optimized for the pairs {X_zmw(S_n), F_n}:

; (4)

where: f₀ and f₁ are the regression coefficients optimized for the pairs {X_zmw(S_n), F_n} [30].

To ensure the reliability of the regression between the X_zmw(S_n) and F_n values, 22 conditions of regression analysis are tested, namely, the presence of linear, sign, and rank correlations between the predicted F_zmw(S_n) and experimental F_n activities; the equality of distributions of these values, the Gaussian distribution of their deviation (F_zmw(S_n)-F_n) and so on. When testing each of the 22 conditions, the significance level a _r at which the r-th condition is met is estimated (where: 1Ł r Ł 22). Within the framework Zadeh’s fuzzy logic [29], each estimation a _r is transformed into uniform scale that is so-called “partial utility of the usage of the feature X_Zmw to predict the activity F”, as follows:

(5)

The highest partial utility u_r=1 is assigned to the feature X_zmw, if the r-th condition is met at significance a _r <0.01. The utility is lowest, u_r=-1, if the r-th condition is not met (a _r> 0.1). The intermediate partial utility u_rÎ[-1, 1] is assigned to the feature X_zmw met the r-th condition with the intermediate a _rÎ[0.01, 0.1]. Within the framework of decision making theory [28], the averaging all the 22 partial utilities gives the integral utility of the usage of the feature X_Zmw to predict the activity F:

. (6)

Only the linearly independent features X_Z,m,w with the highest positive utility are selected:

. (7)

To have the positive utility U(X_Z,m,w,F), the statistical feature X_Z,m,w needs to met at least a half of the 22 conditions of the linear regression applicability. Thus, the probability of a feature X with positive utility U(X, F) selected by chance from 10⁷ features can be estimated with the binomial distribution, such as:

. (8)

Formula (8) shows that each statistical feature X_Z,m,w selected by formula (7) met significantly the linear regression applicability for predicting the site activity. The same is for the conformational and physico-chemical feature X_q,a,b. That is why this selection can be one by one generating and testing.

ALGORITHM

The simple combinatorial algorithm used is schematically shown in Fig.3. This algorithm means the following: all the possible features X(S_n)’s for all the available site sequences S_n’s with known activities F_n’s are calculated by formulae (2) and (3), and also all the possible utilities U(X,F) are estimated by formulae (4), (5) and (6). When all U(X, P)Ł 0, the algorithm terminates without features selected, but, in contrast, when U(X, P)>0, all the possible linear-independent features {X_k}_{1Ł kŁ K} with highest positive {U(X_k, F)>0}_{1Ł kŁ K} are selected. Basing on these features {X_k}_{1Ł kŁ K} selected, the linear-additive approximation (formula 1) to predict the site activity is derived, and, finally, the C-code implementing this prediction is generated and stored [25].

This algorithm has been implemented with Borland C compiler on IBM PC platform to develop the distributed and intelligent database ACTIVITY presented schematically in Fig.4. It contains the following: three databases, computer system generating programs to predict the site activity and the library for the computer programs predicting activities of the sites. ACTIVITY is WWW-available in “real-time” mode through URL “http://www.bionet.nsc.ru/SRCG/Activity/”.

RESULT AND DISCUSSION

The most important unit of the ACTIVITY is the database for DNA and RNA site activity. Currently, it describes more than 70 samples exemplified in Table 2. Among them are promoters and binding sites for different E. coli regulatory proteins, TATA-boxes and binding sites for various eukaryotic transcription factors, translation starts, splicing and 3’processing sites, mutation hotspots and many others. The parameters characterizing specific site activities include the association and dissociation rate, affinity, lifetime of the complexes, product concentrations controlled by these sites, transcription and translation efficiencies, mutation and cutting frequencies, etc. Fig.5 gives the database format by using as example the E. coli promoter strength in terms of “-log[P_bla]” units [23].

ACTIVITY has also the database for conformational and physico-chemical properties of DNA. The current version of the database contains about 40 properties, some of them are listed in Table 3. As an example, Fig.6 gives the presentation of the B-helical angle “Direction” in the database.

These two database are initial for the computer system to generate programs predicting the site activity. For initial data on the E. coli promoter strength (Fig.5), the ACTIVITY output is demonstrated in Fig.7. This output is stored into the knowledge base for the significant features for predicting activity of the site (see scheme in Fig.4). Fig.8 illustrates what does this knowledge mean. The concentration of the trinucleotide ASM weighted by the function w(i) given in Fig.2a correlates significantly with the promoter strength (Fig.8a). The function w(i) assigns the highest weighs to the region (-1; 11). It means that the trinucleotide ASM near the transcription start gives the highest contribution to the promoter strength. The Direction averaged for the region (-5; 15) also correlates with the promoter strength (Fig 8b). Basing on these two features, the linear-additive approximation (1) for predicting the strength is derived (Table 4). Fig.8c compares the experimental and predicted E. coli promoter strength. The linear correlation coefficient r=0.91 means the significant agreement between the experiment and prediction.

Also, Table 4 presents some dozens of eukaryotic promoter sites analyzed by Activity to demonstrate the universality of the linear-additive approximation (1) in this field of intense research. For all these sites, the significant statistical, physico-chemical, or conformational features have been identified and the linear-additive approximations predicting the site activities have been derived.

Analysis of the mouse a A-crystalline gene promoter (Fig.9a) showed that the best physico-chemical feature of its PE1B region near TATA box is the probability to be contacting with nucleosome core. This feature negatively correlates with transcription activity that means the tighter is the interaction of the promoter with nucleosomes, the lower is the transcription activity. This result is consistent with the experimental data showing that nucleosome displacement from a promoter precedes the TBP/TATA-binding [46, 47]. The performed analysis has demonstrated also that such conformational features as the major groove dist (Fig.9b) and the Tilt (Fig.9c) are of great importance for the transcription activity. Using these three features, the linear-additive approximation predicting the transcription activity was derived (Table 4) and tested (Fig.9d).

Analysis of the sequences with known DNA bending in the TBP/TATA complex demonstrated that the bending increases with the inclination (Fig.10a). Similar results were obtained by the X-ray analysis of the TBP/TATA complexes. DNA bending in these complexes was shown to result from intercalation of four phenylalanine residues of the TBP between a pair of adjacent bases on the side of the minor groove [48]. Inclination describes the rotation angle of a pair of bases along the short axis of the pair, and the increase in the angle widens the minor groove [49], thereby facilitating the intercalation of phenylalanines on the minor groove and DNA bending.

The linear-additive approximation (1) can be also applied to synthetic analogues of sites and their mutational variants. Study of the synthetic analogues of the TATA-boxes with known TBP affinity revealed two significant features: (i) the weighted concentration of the dinucleotide TV, contributing primarily to the TBP affinity in the center of the site (Fig.2b); and (ii) the weighted concentration of the dinucleotide WR chiefly contributing to the affinity at the site termini (Fig.2c). The linear-additive approximation predicting the TBP/DNA affinity were derived using these two statistical features (Table 4). An agreement between the experimental and predicted affinities is shown in Fig.10b.

Fig.11 demonstrates that, for an arbitrary site, any conformational (a, b), physico-chemical (d), and also statistical (d) DNA and RNA features can appear to be significant for predict the site activity.

Summing up, we would like to underline that the linear-additive approximation (formula 1) derived for predicting site activities can be helpful in a wide range of investigations in molecular biology. Substantially, ACTIVITY does not require huge body of initial experimental data. It is completely automated and WWW-available (http://www.bionet.nsc.ru/SRCG/Activity/).

This work was supported by grants from the Russian Foundation for Basic Research.

REFERENCES

1. P. Bucher, J. Mol. Biol., 212, 563 (1990)

2. I. Ioshikhes and E.N. Trifonov, Nucleic Acids Res. 21, 4857 (1993)

3. J.D. Helmann, Nucleic Acids Res. 23, 2351 (1995)

4. E. Wingender, A.E. Kel, et al., Nucleic Acids Res., 25, 265 (1997)

5. M.S. Gelfand, J. Comput. Biol., 2, 87 (1995)

6. S. Karlin and V. Brendel, Science, 257, 39 (1992)

7. E.C. Uberbacher, Y. Xu, and R.J. Mural, Methods Enzymol., 266, 259 (1996)

8. Q.K. Chen, G.Z. Hertz, and G.D. Stormo, CABIOS, 13, 29 (1997)

9. J.W. Fickett, Trends Genet., 12, 316 (1996)

10. R. Guigo and J.W. Fickett, J. Mol. Biol., 253, 51 (1995).

11. V.V. Solovyev, A.A. Salamov, and C.B. Lawrence, Nucleic Acids Res., 22, 5156 (1994).

12. E.E. Snyder and G.D. Stormo, J. Mol. Biol., 248, 1 (1995).

13. A.V. Mazin and S.C. Kowalczykowski, Proc. Natl. Acad. Sci. U.S.A., 93, 10673 (1996)

14. J.G. Kim, Y. Takeda, B.W. Matthews, and W.F. Anderson, J. Mol. Biol., 196, 149 (1987)

15. C. Coulondre, J.H. Miller, P.J. Farabaugh, and W. Gilbert, Nature, 274, 775 (1978)

16. A. Gil and N.J. Proudfoot, Cell, 49, 399 (1987)

17. A.A. Sokolenko, I.I. Sadomirsky, and L.K. Savinkova, Mol. Biol. (Msk), 30, 279 (1996).

18. M.E. Mulligan, D.K. Hawley, et al., Nucleic Acids Res., 12, 789 (1984)

19. G.D. Stormo, T.D. Schneider, and Gold, L. (1986) Nucleic Acids Res., 14, 6661 (1986).

20. O.G. Berg and P.H. von Hippel, J. Mol. Biol., 193, 723 (1987)

21. O.G. Berg and P.H. von Hippel, J. Mol. Biol., 200, 709 (1988)

22. D. Barrick, K. Villanueba, et al., Nucleic Acids Res., 22, 1287 (1994)

23. J. Jonsson, T. Norberg, et al., Nucleic Acids Res., 21, 733 (1993)

24. R.J. Kraus, E.E. Murray, et al., Nucleic Acids Res., 24, 1531 (1996)

25. M.P. Ponomarenko, A.N. Kolchanova, and N.A. Kolchanov, J. Comput. Biol., 4, 83 (1997)

26. M.P. Ponomarenko, L.K. Savinkova, et al., Mol. Biol (Msk), 31, 726 (1997)

27. T. Etzold and P. Argos, CABIOS, 9, 49 (1993)

28. P.C. Fishburn, Utility theory for decision making, New York, Jonh Wiley & Sons (1970).

29. L.A. Zadeh, Information and Control, 8, 338 (1965)

30. E. Forster and B. Ronr, Methoden der korrelations- und regressions analyse, Berlin, Verlag Die Wirtschaft (1979)

31. M.R. Gartenberg and D.M. Crothers, Nature, 333, 824 (1988)

L.W. Chiang and M.M. Howe, Genetics, 135, 619 (1993)

33. D.B. Starr, B.C. Hoopes, and D.K. Hawley, J. Mol. Biol., 250, 434 (1995)

34. D. Boyd et al., J. Mol. Biol., 253, 677 (1995)

35. A.J. Bendall and P.L. Molloy, Nucleic Acids Res., 22, 2801 (1994)

36. C.M. Sax, A. Cvelk, et al., Nucleic Acids Res., 23, 442 (1995)

37. A. Kretsovali, and J. Papamatheakis, Nucleic Acids Res., 23, 2919 (1995)

38. M. McDevitt et al., EMBO J., 5, 2907 (1986)

39. C.F. Lesser and C. Guthrie, Genetics, 131, 851 (1993)

40. H. Karas, R. Knuppel, W. Schulz, H. Sklenar, and E. Wingender, CABIOS, 12, 441 (1996)

41. A.A. Gorin, V.B. Zhurkin, and W.K. Olson, J. Mol. Biol., 247, 34 (1995)

42. E.S. Shpigelman, E.N. Trifonov, and A. Bolshoy, CABIOS, 9, 435 (1993)

43. M. Suzuki, N. Yagi, and J.T. Finch, FEBS L., 397, 148 (1996)

44. M.E. Hogan and R.H. Austin, Nature, 329, 263 (1987)

45. N. Sugimoto, S. Nakano, M. Yoneyama, and K. Honda, Nucleic Acids Res., 24, 4501 (1996)

46. D.G. Edmondson and S.Y. Roth, FASEB J., 10, 1173 (1996)

47. J.S. Godde, Y. Nakatani, and A.P. Wolffe, Nucleic Acids Res., 23, 4557 (1995)

48. Z.S. Juo, T.K. Chiu, et al., J. Mol. Biol. 261, 239 (1996).

49. EMBO Workshop, EMBO J., 8, 1 (1989)

Table 1. The sites of the same type can differ in activity by several orders of magnitude

Site name	Site sequence	Activity
DNA/RecA-filament affinity	CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC	350
in E. coli [13]	AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA	1
Cro/DNA-affinity in	TAATGTGAGTTAGCTCACTCAT	91
E. coli [14]	TAATGTAAGTTAGCTCACTCAT	1
2-aminopurine inducted	CGCGTGGTGAACCAGGCCAGCCACG	51
mutations C->T [15]	ACCACCATCAAACAGGATTTTCGCC	1
3’processing pre-mRNA	UUCUACCGGAUCGUUGUGUUCGAGG	13
in SV40 [16]	UUCUACCGGAUCGUUUUGGUCGAGG	1
TBP/TATA-affinity in	GGGGCTATAAAAGGGGGTGG	7
yeast [17]	GTACCTATGGGTCTGCTGGT	1

Table 2. Examples of the sites with known activities available in the database ACTIVITY

Site		Activity				Ref.
Name	Sequences	Parameter	Scale	Min	Max
Cro-binding site	Natural	Association rate constant	ln	19.1	19.9	[14]
CRP-binding site	Natural	Affinity CRP/DNA	ln	-3.2	3.2	[31]
E. coli promoter	Mutant	Promoter strength	-log	0.26	2.1	[23]
C-protein binding site	Mutant	Transcription activity	ln	-6.2	1.8	[32]
TATA box	Mutant	TBP/DNA lifetime	minute	1	185	[33]
TATA box	Mutant	Bend, DNA/TBP complex	degree	33	106	[33]
Transcription signal INR	Mutant	Affinity INR/DNA	ln	-4.6	1.3	[24]
Transcription signal OCT-1	Mutant	Transcription activity	ln	-2.3	0.63	[34]
Transcription signal USF	Synthetic	Affinity USF/DNA	ln	3.8	100	[35]
PE1B box (adjacent TATA box)	Mutant	Transcription activity	ln	-1.4	1.4	[36]
Transcription signal IL-1	Mutant	Transcription activity	ln	-1.9	4.1	[37]
Pre-mRNA 3’cleavage site	Mutant	Cleavage efficiency	%	3	289	[38]
Pre-mRNA donor splice site	Mutant	Cleavage efficiency	%	18	100	[39]
E. coli ribosome-binding site	Synthetic	Translation activity	ln	0.0	8.06	[22]
2-aminopurine induced mutation	Natural	Mutation frequency	ln	0.0	5.6	[15]

Table 3. Examples of the DNA properties available in the database ACTIVITY

Property name	Unit	Min	Max	Ref.
Conformational:
Twist	Degree	31.1	41.4	[40]
Propeller	Degree	-17.3	-6.7	[41]
Tip	Degree	-1.64	6.7	[40]
Inclination	Degree	-1.43	1.43	[40]
Tilt	Degree	-2.6	0.6	[41]
Bend	Degree	2.16	6.74	[40]
Wedge	Degree	1.1	8.4	[42]
Direction	Degree	-154	180	[42]
Roll	Degree	-6.2	6.2	[43]
Rise	Angstrom	3.16	4.08	[40]
Slide	Angstrom	-0.4	1.6	[43]
Minor groove width (width)	Angstrom	4.62	6.40	[40]
Minor groove depth (depth)	Angstrom	8.79	9.11	[40]
Minor groove width size (size)	Angstrom	2.7	4.7	[41]
Minor groove width distance (dist)	Angstrom	2.79	4.24	[41]
Major groove width (WIDTH)	Angstrom	12.1	15.5	[40]
Major groove depth (DEPTH)	Angstrom	8.45	9.60	[40]
Major groove size (SIZE)	Angstrom	3.26	4.70	[41]
Major groove distance (DIST)	Angstrom	3.02	3.81	[41]
Physico-chemical:
Clash strength	f	0.00	2.53	[41]
Bending mobility to minor groove	m	1.02	1.27	[31]
Bending mobility to major groove	m	0.99	1.18	[31]
Persistent length	nl	20	130	[44]
Melting temperature	^oC	36.7	136.1	[44]
Probability to be contacting nucleosome core	%	1	18	[44]
Enthalpy change	kcal/mol	-11.8	-5.6	[45]
Entropy change	cal/mol/K	-28.4	-15.2	[45]
Free energy change	kcal/mol	-2.8	-0.9	[45]

Table 4. Examples of the functional DNA and RNA sites analyzed by the system ACTIVITY

	Site			Feature selected			Significance
Name	Position “1”	n	Activity, F	X_k	Region	Property	U	r	a
E. coli promoters	Transcription	27	Strength	X₁	Fig.2a	[ASM]	0.59	0.86	10^-2
[23]	start			X₂	-5; 15	Direction	0.50	0.71	10^-2
				F=0.3+0.6´ X₁+0.0008´ X₂				0.91	10^-4
PE1B region adjacent to	Transcription	10	Transcription	X₁	-32; -25	Pnucl	0.36	-0.77	10^-2
the TATA box of	start		activity	X₂	-29; -19	DIST	0.41	0.86	10^-3
the a A-crystalline				X₃	-31; -25	Tilt	0.38	-0.78	10^-2
promoter [36]				F=-39-0.1´ X₁+12´ X₂-X₃				0.90	10^-4
TATA boxes	Synthetic	19	Affinity to	X₁	Fig.2b	[TV]	0.35	0,73	10^-2
(synthetic)	DNA		yeast TBP	X₂	Fig.2c	[WR]	0.41	0,76	10^-2
[17]	start			F=14.5+2.5´ X₁+0.9´ X₂				0,77	10^-2
TATA boxes	TATA box	9	DNA bending	X₁	0, 9	Inclination	0.19	0.76	0.05
(mutant) [33]	start		TBP/TATA	F=120.15+70.32´ X₁				0.76	0.05
USF binding site	Synthetic	14	Affinity	X₁	11, 15	depth	0.22	-0.78	10^-3
(synthetic) [35]	DNA start		USF/DNA	X₂	11; 20	Twist	0.23	-0.86	10^-4
				F=170-16.3´ X₁-0.7´ X₂				0.91	10^-5
CRP-binding site [31]	Center of the	10	Affinity	X₁	-15; 14	Rise	0,15	-0,86	10^-2
	consensus		CRP/DNA	X₂	-17; 12	width	0.06	0,78	10^-2
	repeat			F=190-66.8´ X₁+7.5´ X₂				0.87	10^-2
2-aminopurine induced	Mutation	26	Mutation	X₁	-1, 2	Tmelt	0,20	0,90	10^-5
mutations C->T [15]	point		frequency	F=-8.5568+0.1585´ X₁				0,90	10^-5
ssDNA/RecA-filament	Synthetic	15	DNA/RecA	X₁	Fig.2d	[DRV]	0,27	-0,89	10^-5
(synthetic) [13]	DNA start		affinity	F=0.54 - 1.03 ´ X₁				0,89	10^-5
the SV40 pre-mRNA	RNA cutting	16	Cutting	X₁	Fig.2a	[VUKK]	0.24	0,76	10^-4
3’processing site [16]	point		frequency	F=-301.72+216.16´ X₁				0,76	10^-4
Cro-binding site [14]	Consensus	7	Affinity	X₁	1; 16	width	0,55	0.97	10^-3
	start		Cro/DNA	X₂	6; 19	Roll	0,44	0.90	10^-3
				X₃	6, 19	Rise	0,41	0.92	10^-2
				F=-72+4´ X₁+X₂+13´ X₃				0.99	10^-5

Notes: n, the total number of the site variants; X_k, the selected context-dependent feature; U, utility value; r, linear correlation coefficient; a , significance of the linear correlation coefficient value; [Z], the concentration of the oligonucleotide Z; weighted with the weighted function w(i) given in Fig.2 (formula 2); Pnucl, probability to be contacting with nucleosome core; Tmelt, melting temperature; depth, minor groove depth; width, minor groove width; WIDTH, major groove width; DIST, major groove dist; F=F₀+S _k=1,KF_k´X_k, linear-additive approximation (formula 1) predicting the site activity.

FIGURE LEGENDS

Fig. 1. The tetranucleotide VUKK concentration is responsible for the SV40 pre-mRNA 3’processing efficiency (a); and the major groove width is responsible for the Cro/DNA affinity (b).

Fig. 2. Examples of the weight functions w(i) modeling the highest effect of oligonucleotides located within the site 3’-half (a), central part (b), termini (c) and near 5’-terminus (d) on site activity.

Fig. 3. Algorithm for generating the C-code programs to predict site activities (where: the indexes “f g l ” are either the indexes “Z,m,w” or the indexes “q,a,b” respectively in formulae (2) or (3)).

Fig. 4. A scheme of the distributed and intelligent database ACTIVITY.

Fig. 5. The description of experimental data on the E.coli promoter strength [23] within ACTIVITY: MI, entity identifier; MN, sample name; OG, genome region; OS, species; FF, site; AN, activity name; AU, activity unit; SC, variant; SQ, sequence; SA, activity value.

Fig. 6. The description of the conformational property “Direction” [42] within ACTIVITY: MI, entity identifier; MN, property type; MD, molecule; ML, step; PN, property name; PM, identifying method; PU, property unit; DINUCLEOTIDE, property values.

Fig. 7. The ACTIVITY result of the E.coli promoter strength. Fields: MI, entity identifier; MN, sample name; CF, feature type; PV, property/oligonucleotide; AB, region; UT, utility; LC, linear correlation coefficient; C-CODE of the computer program calculating the feature.

Fig. 8. An interpretation of the ACTIVITY result of the E.coli promoter strength: (a) the trinucleotide ASM concentration correlates with the promoter strength; (b) the Direction correlates with the strength (r=0.71); (c) the agreement between the experiment and prediction.

Fig. 9. The ACTIVITY result of the transcription activity of the mouse a A-crystalline gene promoter containing the PE1B region near TATA-box [36]: (a) the probability to be contacting with nucleosome core correlates negatively with the transcription activity; (b) the major groove dist correlates positively with the transcription activity; (c) the tilt correlates negatively with the transcription activity; (d) the agreement between the experimental and prediction data (Table 4).

Fig. 10. The ACTIVITY result of the TATA boxes: (a) the agreement between the experimental [17] and predicted TBP/DNA affinity; (b) the DNA bend within the TBP/TATA complex [33] correlates with the inclination.

Fig. 11. Examples of ACTIVITY-results: (a) the USF/DNA affinity [35] correlates with the twist; (b) the CRP/DNA affinity [31] correlates with the rise; (c) the frequency of the mutation induced by 2-aminopurine [15] correlates with the DNA melting temperature; (d) the ssDNA/RecA-filament affinity [13] correlates with the trinucleotide DRV concentration weighted by the weight function w(i) given in Fig.2d.

a) b)