SAMS, 1995, v. 18-19, pp. 819-822.

Computer analysis of the structure of transcription factor binding sites



Institute of Cytology and Genetics, Siberian Branch, Russian Academy of Sciences, 630090 Novosibirsk, Russia



A method for revealing of correlated nucleotide positions in CRE based on entropy coefficient is described. Runs of related positions are found. Connection of the correlations with DNA helical structure and possible application for recognition is discussed .



Recognition of potential binding sites for transcription factors is one of the key task for investigation of transcription regulation. Among the appro-aches to functional site analysis and recognition, the following are in use: consensus, matrix of nucleotide frequencies [1] , discrimination energy [2], con-sensus classification [3]. As a rule, these methods are applied in analysis of rather narrow regions including core of functional sites under investigation. However, a large amount of contextual features of long flanking sequences around functional sites is important for functional site recognition [4].

We introduce a method for revealing of correlated nucleotide positions in long DNA sequences flanking functional sites. The method was applied for analysis of CRE flanking regions in genes of vertebrates. The sample of CRE was compiled using the TRANSFAC database [5]. 31 sequences were uniform in length L=98bp, and the core sequence TGACgt is present at position 46. All the sequences are aligned to optimal matching to the TGAC core sequence.



The entropy correlation coefficient [6] is applied for analysis of correlation between nucleotide positions. For each position i we can calculate the entropy H(i) that estimate the value of nucleotide variability at the i-th position of a sample of N sequences.

, (1)

where is the frequency of nucleotide at the i-th position. , where is the number of nucleotide s at the i-th position throughout all the sequences. For any two positions i and j we can calculate the entropy:

, (2)

where is the frequency of simultaneous presence of two nucleotides s and r at positions i and j respectively. , where is the number of sequences that contain nucleotides s and r in i and j positions respectively. The entropy correlation coefficient is.

, (3)

where I(i,j) = H(i) + H(j) - H(i,j) is the information measure of dependence between positions i and j.

The entropy coefficient C(i,j) values are in the range of [0,1]. If C(i,j)=0, positions i and j are independent, if C(i,j)=1, positions i and j are strongly correlated. To estimate the upper cut-off value for the entropy correlation coefficient we have made computer simulations of random sequences. The upper cut-off level C*(i,j)=0.27 for N = 31 with the confidential level 0.05.


Analysis of the CRE flanking regions was carried out by the entropy correlation coefficient. Examples of the most significantly correlated i-j pairs are given in Table 1. The most frequent pairs of nucleotides for the i-j pairs are shown. One of the most prominent correlated pair is 5-67. Two pairs of nucleotides c g and g a are very frequent at these two positions. Application of such nucleotide correlations can advance recognition of the CRE. For example, 68% of CRE sequences in contrast to only 25% of random sequences contain pairs c s or g w in the positions 5 and 67. Use of the all correlated positions will certainly rise the recognition ability of methods to be applied.

TABLE 1. Examples of the most significantly related i-j pairs

Related pairs i-j


The most frequent pairs of nucleotides














We have found significantly high amount of the correlated position runs of 3bp length in the CRE sequences. We call run two sets of positions when each consequent position in one set correlate with the consequent position in the second set. Based on the queue theory 7, we can write down the mean number of runs of length l in case of random distribution of correlated position.

, (4)

where, - the number of all possible pairs; - the probability to get correlation between two fixed positions i and j. Here m - the number of found correlated pairs of positions. In accordance with (4) the mean number of runs of length l=3, is = 1.2, expected by random chance. We have found K=7 runs of such length. The probability P(K>6) to get such amount of runs by random chance is rather small.

, (5)

An example of a run of length 5bp, from 11 to 15 related with positions from 19 to 24, is partially presents in Table 1. It is interesting to note that the distances between the correlated positions in this run are roughly equal to one turn of DNA helix. So, nucleotides in the correlated positions are situated on the same side of DNA and can simultaneously interact with transcription complex.

We have made a distribution of all distances between related positions for the CRE sequences (Fig.1). One can see that many of the related positions are situated very closely on 1 or 2 bp. One of the most interesting feature of this distribution is its periodicity with about 10bp cycle. The most pronounced picks of this distribution that correspond to the most amount of related positions fit the distances about 10, 30, 50 and 60 bp (shown by arrows).


FIGURE 1. Distribution of distances between related positions



1. M.B.Shapiro and P. Senapathy, Nucleic Acids Res., 15,7155-7174 (1987)

2. O.G.Berg and von P.H.Hippel, J.Mol.Biol. , 193 , 723-750 (1988)

3. M.Kudo, S.Kitamura-Abe, M Shimbo and Y.Iida, Comput. Applic.Biosci., 8 , 367-376 (1992)

4. A.E.Kel, M.P.Ponomarenko, E.A.Likhachev, Yu.L.Orlov, I.V.Ischenko, L.Milanesi, N.A.Kolchanov, Comput. Applic.Biosci., 9 , 617-627 (1993)

5. Knueppel, R., Dietze, P., Lehnberg, W., Frech, K. and Wingender, E. , J. Comput. Biol. , 1, 191-98 (1994).

6. Aivazyan, V.M.Buchstaber, I.S.Yenyukov, L.D.Meshalkin, Applied Statistics. Classification and reduction of dimensionality. (Finansy i statistika, Moskow, 1989)

7. W.Feller, An Introduction to probability theory and its applications. Volume I (1967)