The search for DNA regions with low complexity is one of pivotal tasks of modern structural analysis of complete genomes. The low complexity may be preconditioned by strong inequality in nucleotide content (biased composition), by tandem or dispersed repeats, by palindrome-hairpin structures, as well as by combination of all these factors. Numerical measures of text complexity including combinatorial and linguistic ones, together with complexity estimation by modified Lempel-Ziv algorithm are reviewed. Methods were implemented in a software tools LZcomposer and LowComplexity.
Intuitively, complexity of symbolic sequence reflects an ability to represent a sequence in a compact form based on some structural features of this sequence. To evaluate text complexity, several groups of methods were developed: entropy measures (Shannon, 1948), with the simplest of them using only the alphabet symbol frequencies (Wootton&Federhen, 1996); method of clusterization of cryptically simple sequences (Hancock&Armstrong, 1994; Alba et al., 2002); evaluation of the alphabet capacity l-gram (combinatorial complexity and linguistic complexity) (Milosavljevic&Jurka,1993; Kisliuk et al., 1999; Gabrielian&Bolshoy, 1999; Troyanskaya et al., 2002); modifications of complexity measure by Lempel and Ziv (Gusev et al., 1991; 1999; Chen et al., 1999); stochastic complexity (Orlov et al., 2002); and grammatical complexity (Jimenez-Montano et al., 2002).
The general approach to estimating complexity of symbolic sequences (texts) was suggested by A.N. Kolmogorov (Kolmogorov, 1965). He proved that there exists an optimal algorithm or a program for the text generation. Kolmogorov complexity is the length of the shortest code generating given sequence. Kolmogorov complexity is not recursive function (is not realized in computational scheme). However, for the sequence of finite length, various constructive realizations of non-optimal coding were developed (Lempel and Ziv, 1976; Ebeling and Jimenes-Montano, 1980; Gusev et al., 1991; 1999; Babenko et al., 1999; Chen et al. 1999). Approaches to genetic sequence analysis based on compression algorithms, has been suggested also by Grumbach and Tahi, Milosavljevic and Rivals (Rivals et al., 1996). At whole information content of genetic sequences was intensively studied (Allison et el., 2000).
There have been a considerable number of theoretical studies on information measures and complexity for genetic macromolecules (Ebeling and Jimenez-Montano, 1980). Compression algorithms applied to DNA sequences were discussed by Grumbach and Tahi (Grumbach and Tahi, 1994). Entropies of DNA sequences have been discussed in (Mantegna et al., 1994; Schmitt and Herzel, 1997). The mutual information, a measure intimately related to entropies, has been successfully used to predict protein coding regions in DNA (Grosse et al., 2000). A more systematic approach is based on the reconstruction of the word frequencies using frequencies of other, shorter sub-words (Gorban et al., 2000). A method for sequences comparison based on the specific entropy of a frequency dictionary was suggested by M.Sadovsky (Sadovsky, 2003). Calculations made on several genomes detected no significant differences of reconstructed frequencies from real ones. The complexity of large sets of non-redundant protein sequences was measured in (Weiss et al., 2000); there it was shown that proteins are fairly close to random sequences.
As the method for complexity evaluation, we have chosen the scheme of the text representation in terms of repeats, which uses the concept of complexity of a finite symbolic sequence, introduced by Lempel and Ziv (Lempel and Ziv, 1976). The approach of Lempel and Ziv is oriented to development of efficient algorithm for data compression. While studying complexity, we are interested not in a mere compression of genetic texts, but rather in detection of the regularities underlying it. The Lempel–Ziv complexity measure is based on text segmentation; we termed it as complexity decomposition. It may be interpreted as representation of a text in terms of repeats. Initially, this approach was implemented for analyzing DNA by Gusev and co-authors (Gusev et al., 1989; 1991; 1993; 1999). Local sequence complexity is connected with molecular mechanisms of mutagenesis (Chuzhanova et al., 2003).
An computer realization of this measure was suggested by (Babenko et al., 1999). The program Complexity profile_builder is available at http://wwwmgs/mgs/programs/gc_net/.
We present LZcomposer as new implementation with additional capabilities connected with long sequences (complete genomes), flexible profile construction, using different alphabets, analysis of several sequences and relative complexity estimations for group of sequences.
In LowComplexity software suggested, we have realized some additional known estimates of complexity in order to compare different approaches. A user may choose the method of his interest or construct several complexity profiles simultaneously. In particular, we have realized the following evaluations of the text complexity: (1) by nucleotide frequency content (Wootton&Federhen, 1996); (2) by entropy of the given order of words (oligonucelotides); (3) by linguistic complexity (Troyanskaya et al., 2002). In the latter approach, linguistic complexity refers to combinatorial complexity denoted as the power of the l-gram dictionary under the fixed l. By summing up the values of combinatorial complexity over all values l, 1£l£N, where N is the sequence length, and dividing by maximal dictionary size we obtain the value of linguistic complexity. This measure was applied for studying the patterns composing nucleosomes and promoters (Gabrielian and Bolshoy, 1999; Troyanskaya et al., 2002).
Entropy-based evaluations of the text complexity (entropy of the l-th order) were introduced by Shannon (Shannon, 1948). Evaluations based on the variable memory Markov model may serve as a generalization (Orlov et al., 2002). For analysis of extended sequence complexity, a program is available that estimates stochastic complexity as the measure of data compression in the variable memory Markov model. An optimal variable memory Markov model is determined by a program in a form of the context source-tree as an optimal model for the text compression. Stochastic complexity is determined as sum of the size of compressed text and the complexity of model parameters. The method could be applied for analysis and comparison of extended sequence complexity (for example, complete bacterial genomes) and described in (Orlov et al., 2002).
By application of l-gram trees for the sequence representation in our software, we have resolved the computational problems stemming from a considerable length of the sequences processed. Our package is designed for analysis of an arbitrary symbolic sequence, including DNA and amino acid sequences.
The software is designed for effective search for the regions with low complexity in extended DNA sequences. The search is provided by different methods and its operation time is linearly dependent upon the sequence length. The program software enables to calculate average complexity profile for the sets of sequences in the FASTA format. By using the software, we have demonstrated the higher nucleotide sequence complexity of exons and lower complexity of introns (Orlov et al., 2004, in press). Also, we have found the alteration of the local text complexity for the splicing sites.
Recent findings by the program in promoter complexity, complete bacterial genomes complexity comparison
References
Ebeling,W. and Jimenez-Montano,M. A. (1980). On grammars, complexity, and information measures of biological macromolecules. Math. Biosci. 52, 53-71.
Gorban,A.N., Popova,T.G. and Sadovsky,M.G. (2000). Classification of symbol sequences over their frequency dictionaries: towards the connection between structure and natural taxonomy. Open Sys.& Information Dyn. 7, 1-17.
Grosse,I., Herzel,H., Buldyrev,S.V. and Stanley,H.E. (2000). Species independence of mutual information in coding and noncoding DNA. Phys. Rev. E 61, 5624-5629.Grumbach,S. and Tahi,F. (1994) A new challenge for compression algorithms: genetic sequences. J. Inf. Process. Manage 30, 875-886.
Gusev,V.D., Kulichkov,V.A. and Chupakhina,O.M. (1989) Complexity analysis of genetic texts (with phage lambda as an example). Preprint no.20. Institute of Mathematics Acad. Sci. USSR, Siberian branch, Novosibirsk. (in Russian).
Kolmogorov,A.N. (1965) Three approaches to definition of information quantity. Probl. Peredachi Inf. 1, 3–11. (in Russian).
Lempel,A. and Ziv,J. (1976) On the complexity of finite sequences. IEEE Trans. Inf. Theory, IT-22, 75–81.
Orlov,Yu.L., Filippov,V.P., Potapov,V.N. and Kolchanov,N.A. (2002) Construction of stochastic context trees for genetic texts. In Silico Biology, 2, 233-247, <http:www.bioinfo.de/isb/2002/02/0022/>.
Orlov Yu.L., Gusev V.D. and Miroshnichenko L.A. (2004) LZcomposer: decomposition of genomic sequences by repeat fragments. Biophysics (Mosk.), (in Press)
Shannon, C.E. (1948) A mathematical theory of communication. Bell Syst. Tech. J., 27, pt.I., 379-423; pt.II., 623-656.
To search low complexity regions use Low Complexity program
Related complexity analysis algorithms:
Complexity by context tree source by Orlov&Potapov, Novosibirsk, Russia (Orlov et al., 2002)
SIMPLE by J.Hancock, Royal Holloway University of London, M.Alba, Universitat Pompeu Fabra, Spain (Alba et al., 2002)
Linguistic Complexity by G.Landau, A.Bolshoy, University of Haifa, Israel (Troyanskaya et al., 2002)
Transformation Distance by J.Varre, Universite de Lille, France (Varre et al., 1999)
GenCompress, M.Li, S.Kwong, X.Chen, City University of HongKong, China (Chen et al., 1999)
Old realization of DNA-oriented Lempel-Ziv algorithm by (Babenko et al., 1999): Complexity profile_builder
Graphical presentation of sequence regularities: OligoRep system, Verbumculus, DeBruijn graphs
This resource has been developed in Institute of Cytology
and Genetics, based on methods developed in Sobolev Institute of Mathematics,
Novosibirsk, Russia
Authors: Yu.L.Orlov,
V.N.Potapov, V.D.Gusev,
L.A.Miroshnichenko(Nemytikova)
The research was partially supported by RFBR,
Russian Ministry of
Education (E 02-6.0-250), and Siberian Branch of
the Russian Academy of Sciences (Integration projects 65 and 119).