Review of the methods of genetic text complexity analysis

 

The search for DNA regions with low complexity is one of pivotal tasks of modern structural analysis of complete genomes. The low complexity may be preconditioned by strong inequality in nucleotide content (biased composition), by tandem or dispersed repeats, by palindrome-hairpin structures, as well as by combination of all these factors. Numerical measures of text complexity including combinatorial and linguistic ones, together with complexity estimation by modified Lempel-Ziv algorithm are reviewed. Methods were implemented in a software tools LZcomposer and LowComplexity.  

Analysis of genomic sequences issues the challenge to search for the regions with the low text complexity, which could be functionally important (Hancock, 2002; Wan&Wootton, 2000; Chuzhanova et al., 2003; Stern et al., 2001). Low complexity regions are often treated as the regions of biased composition containing simple sequence repeats (Hancock, 2002; Tautz et al., 1986). The sequence enriched with imperfect direct and inverted repeats may be also considered as the sequence with low complexity (Cox&Mirkin, 1997). Search for simple regions in amino acid sequences is important for protein structure analysis (Alba et al., 2002; Nandi et al., 2003; Ramachandran, 2003).

Intuitively, complexity of symbolic sequence reflects an ability to represent a sequence in a compact form based on some structural features of this sequence. To evaluate text complexity, several groups of methods were developed: entropy measures (Shannon, 1948), with the simplest of them using only the alphabet symbol frequencies (Wootton&Federhen, 1996); method of clusterization of cryptically simple sequences (Hancock&Armstrong, 1994; Alba et al., 2002); evaluation of the alphabet capacity l-gram (combinatorial complexity and linguistic complexity) (Milosavljevic&Jurka,1993; Kisliuk et al., 1999; Gabrielian&Bolshoy, 1999; Troyanskaya et al., 2002); modifications of complexity measure by Lempel and Ziv (Gusev et al., 1991; 1999; Chen et al., 1999); stochastic complexity (Orlov et al., 2002); and grammatical complexity (Jimenez-Montano et al., 2002).

The general approach to estimating complexity of symbolic sequences (texts) was suggested by A.N. Kolmogorov (Kolmogorov, 1965). He proved that there exists an optimal algorithm or a program for the text generation. Kolmogorov complexity is the length of the shortest code generating given sequence. Kolmogorov complexity is not recursive function (is not realized in computational scheme). However, for the sequence of finite length, various constructive realizations of non-optimal coding were developed (Lempel and Ziv, 1976; Ebeling and Jimenes-Montano, 1980; Gusev et al., 1991; 1999; Babenko et al., 1999; Chen et al. 1999). Approaches to genetic sequence analysis based on compression algorithms, has been suggested also by Grumbach and Tahi, Milosavljevic and Rivals (Rivals et al., 1996). At whole information content of genetic sequences was intensively studied (Allison et el., 2000).

There have been a considerable number of theoretical studies on information measures and complexity for genetic macromolecules (Ebeling and Jimenez-Montano, 1980). Compression algorithms applied to DNA sequences were discussed by Grumbach and Tahi (Grumbach and Tahi, 1994). Entropies of DNA sequences have been discussed in (Mantegna et al., 1994; Schmitt and Herzel, 1997). The mutual information, a measure intimately related to entropies, has been successfully used to predict protein coding regions in DNA (Grosse et al., 2000). A more systematic approach is based on the reconstruction of the word frequencies using frequencies of other, shorter sub-words (Gorban et al., 2000). A method for sequences comparison based on the specific entropy of a frequency dictionary was suggested by M.Sadovsky (Sadovsky, 2003). Calculations made on several genomes detected no significant differences of reconstructed frequencies from real ones.  The complexity of large sets of non-redundant protein sequences was measured in (Weiss et al., 2000); there it was shown that proteins are fairly close to random sequences.

As the method for complexity evaluation, we have chosen the scheme of the text representation in terms of repeats, which uses the concept of complexity of a finite symbolic sequence, introduced by Lempel and Ziv (Lempel and Ziv, 1976). The approach of Lempel and Ziv is oriented to development of efficient algorithm for data compression. While studying complexity, we are interested not in a mere compression of genetic texts, but rather in detection of the regularities underlying it. The Lempel–Ziv complexity measure is based on text segmentation; we termed it as complexity decomposition. It may be interpreted as representation of a text in terms of repeats. Initially, this approach was implemented for analyzing DNA by Gusev and co-authors (Gusev et al., 1989; 1991; 1993; 1999). Local sequence complexity is connected with molecular mechanisms of mutagenesis (Chuzhanova et al., 2003). 

An computer realization of this measure was suggested by (Babenko et al., 1999). The program  Complexity profile_builder is available at http://wwwmgs/mgs/programs/gc_net/.

We present LZcomposer  as new implementation with additional capabilities connected with long sequences (complete genomes), flexible profile construction, using different alphabets, analysis of several sequences and relative complexity estimations for group of sequences. 

In LowComplexity software suggested, we have realized some additional known estimates of complexity in order to compare different approaches. A user may choose the method of his interest or construct several complexity profiles simultaneously. In particular, we have realized the following evaluations of the text complexity: (1) by nucleotide frequency content (Wootton&Federhen, 1996); (2) by entropy of the given order of words (oligonucelotides); (3) by linguistic complexity (Troyanskaya et al., 2002). In the latter approach, linguistic complexity refers to combinatorial complexity denoted as the power of the l-gram dictionary under the fixed l. By summing up the values of combinatorial complexity over all values l, 1£l£N, where N is the sequence length, and dividing by maximal dictionary size we obtain the value of linguistic complexity. This measure was applied for studying the patterns composing nucleosomes and promoters (Gabrielian and Bolshoy, 1999; Troyanskaya et al., 2002).

Entropy-based evaluations of the text complexity (entropy of the l-th order) were introduced by Shannon (Shannon, 1948). Evaluations based on the variable memory Markov model may serve as a generalization (Orlov et al., 2002). For analysis of extended sequence complexity, a program is available that estimates stochastic complexity as the measure of data compression in the variable memory Markov model. An optimal variable memory Markov model is determined by a program in a form of the context source-tree as an optimal model for the text compression. Stochastic complexity is determined as sum of the size of compressed text and the complexity of model parameters. The method could be applied for analysis and comparison of extended sequence complexity (for example, complete bacterial genomes) and described in (Orlov et al., 2002).

By application of l-gram trees for the sequence representation in our software, we have resolved the computational problems stemming from a considerable length of the sequences processed. Our package is designed for analysis of an arbitrary symbolic sequence, including DNA and amino acid sequences.

The software is designed for effective search for the regions with low complexity in extended DNA sequences. The search is provided by different methods and its operation time is linearly dependent upon the sequence length. The program software enables to calculate average complexity profile for the sets of sequences in the FASTA format. By using the software, we have demonstrated the higher nucleotide sequence complexity of exons and lower complexity of introns (Orlov et al., 2004, in press). Also, we have found the alteration of the local text complexity for the splicing sites.

 Recent findings by the program in promoter complexitycomplete bacterial genomes complexity comparison

References

Abeysinghe,S.S., Chuzhanova,N., Krawczak,M., Ball,E.V., Cooper,D.N. (2003) Translocation and gross deletion breakpoints in human inherited disease and cancer I. Nucleotide composition and recombination-associated motifs. Hum Mutation 22(3), 229-244.

Alba,M.M., Laskowski,R.A. and Hancock,J.M. (2002) Detecting cryptically simple protein sequences using the SIMPLE algorithm. Bioinformatics, 18, 672-678.

Allison,L., Stern,L., Edgoose,T. and Dix,T.I. (2000) Sequence complexity for biological sequence analysis. Comput Chem., 24(1), 43-55.

Babenko,V.N., Kosarev,P.S., Vishnevsky,O.V., Levitsky,V.G., Basin,V.V. and Frolov,A.S. (1999) Investigating extended regulatory regions of genomic DNA sequences. Bioinformatics, 15, 644–653.

Chandra,S., Kapur,R., Chuzhanova,N., Summey,V., Prentice,D., Barker,J., Cooper,D., Williams,D.A. (2003) A rare complex DNA rearrangement in the murine Steel gene results in exon duplication and a lethal phenotype. Blood 102, 3548-3555.

Chen,X., Kwong,S. and Li,M.A. (1999) Compression algorithm for DNA sequences and its applications in genome comparison. Genome Inform. Ser. Workshop Genome Inform., 10, 51–61.

Chuzhanova,N., Abeysinghe,S.S., Krawczak,M., Cooper,D.N. (2003) Translocation and gross deletion breakpoints in human inherited disease and cancer II. Potential involvement of repetitive sequence elements in secondary structure formation between DNA ends. Hum Mutation 22(3), 245-251.

Chuzhanova,N.A., Anassis,E.J., Ball,E., Krawczak,M. and Cooper,D.N. (2003) Meta-analysis of indels causing human genetic disease: mechanisms of mutagenesis and the role of local DNA sequence complexity. Hum. Mutation, 21, 28-44.

Cox,R. and Mirkin,S.M. (1997) Characteristic enrichment of DNA repeats in different genomes. Proc. Natl. Acad. Sci. U S A, 94, 5237-5242.

Ebeling,W. and Jimenez-Montano,M. A. (1980). On grammars, complexity, and information measures of biological macromolecules. Math. Biosci. 52, 53-71.

Gabrielian,A. and Bolshoy,A. (1999) Sequence complexity and DNA curvature. Comput. Chem., 23, 263-74.

Gorban,A.N., Popova,T.G. and Sadovsky,M.G. (2000). Classification of symbol sequences over their frequency dictionaries: towards the connection between structure and natural taxonomy. Open Sys.& Information Dyn. 7, 1-17.

Grosse,I., Herzel,H., Buldyrev,S.V. and Stanley,H.E. (2000). Species independence of mutual information in coding and noncoding DNA. Phys. Rev. E 61, 5624-5629.

Grumbach,S. and Tahi,F. (1994) A new challenge for compression algorithms: genetic sequences. J. Inf. Process. Manage 30, 875-886.

Gusev,V.D., Kulichkov,V.A. and Chupakhina,O.M. (1989) Complexity analysis of genetic texts (with phage lambda as an example). Preprint no.20. Institute of Mathematics Acad. Sci. USSR, Siberian branch, Novosibirsk. (in Russian).

Gusev,V.D., Kulichkov,V.A. and Chupakhina,O.M. (1991) Complexity analysis of genomes. I. Complexity and classification methods of detected structural regularities. Mol. Biol. (Mosk). 25, 825-834.

Gusev,V.D., Kulichkov,V.A. and Chupakhina,O.M. (1993) The Lempel-Ziv complexity and local structure analysis of genomes. Biosystems, 30(1-3), 183-200. 

Gusev,V.D., Nemytikova,L.A. and Chuzhanova,N.A. (1999) On the complexity measures of genetic sequences. Bioinformatics, 15, 994–999.

Hancock, J.M. and Armstrong, J.S. (1994) SIMPLE34: an improved and enhanced implementation for VAX and SUN computers of the SIMPLE algorithm for analysis of clustered repetitive motifs in nucleotide sequences. CABIOS 10: 67-70.

Hancock,J.M. (2002) Genome size and the accumulation of simple sequence repeats: implications of new data from genome sequencing projects. Genetica, 115, 93-103.

Jimenez-Montano,M.A., Ebeling,W., Pohl,T. and Rapp,P.E. (2002) Entropy and complexity of finite sequences as fluctuating quantities. Biosystems, 64, 23-32.

Kisliuk,O.S., Borovina,T.A. and Nazipova,N.N. (1999) Evaluation of genetic test redundancy using a high-frequency component of the l-gram graph. Biofizika (Mosk.), 44, 639–648. (in Russian).

Kolmogorov,A.N. (1965) Three approaches to definition of information quantity. Probl. Peredachi Inf. 1, 3–11. (in Russian).

Lempel,A. and Ziv,J. (1976) On the complexity of finite sequences. IEEE Trans. Inf. Theory, IT-22, 75–81.

Mantegna, R. N., Buldyrev, S. V., Goldberger, A. L., Havlin, S., Peng, C.-K., Simons, M. and Stanley, H. E. (1994). Linguistic features of non-coding DNA sequences. Phys. Rev. Lett. 73, 3169-3172.

Milosavljevic,A. and Jurka,J. (1993) Discovering simple DNA sequences by the algorithmic significance method. Comput Appl Biosci. 9(4), 407-11. 

Nandi,T., Dash,D., Ghai,R., B.-Rao,C., Kannan,K., Brahmachari,S.K., Ramakrishnan,C. and Ramachandran,S. (2003) A novel complexity measure for comparative analysis of protein sequences from complete genomes. J. Biomol. Struct. Dyn., 20, 657-668.

Orlov,Yu.L., Filippov,V.P., Potapov,V.N. and Kolchanov,N.A. (2002) Construction of stochastic context trees for genetic texts. In Silico Biology, 2, 233-247, <http:www.bioinfo.de/isb/2002/02/0022/>.

Orlov Yu.L., Gusev V.D. and Miroshnichenko L.A. (2004) LZcomposer: decomposition of genomic sequences by repeat fragments. Biophysics (Mosk.), (in Press)

Rivals,E., Dauchet,M., Delahaye,J.P. and Delgrange,O. (1996) Compression and genetic sequence analysis. Biochimie, 78(5), 315-22. 

Sadovsky,M.G. (2003) The method to compare nucleotide sequences based on the minimum entropy principle. Bull Math Biol. 65(2), 309-322.

Schmitt,A.O. and Herzel,H. (1997) Estimating the entropy of DNA sequences. J. Theor. Biol. 188, 369-377.

Shannon, C.E. (1948) A mathematical theory of communication. Bell Syst. Tech. J., 27, pt.I., 379-423; pt.II., 623-656.

Stern,L., Allison,L., Coppel,R.L. and Dix,T.I. (2001) Discovering patterns in Plasmodium falciparum genomic DNA. Mol. Biochem. Parasitol., 118, 175-186.

Tautz,D., Trick,M. and Dover,G.A. (1986) Cryptic simplicity in DNA is a major source of genetic variation. Nature, 322, 652-656.

Troyanskaya,O.G., Arbell,O., Koren,Y., Landau,G.M. and Bolshoy,A. (2002) Sequence complexity profiles of prokaryotic genomic sequences: a fast algorithm for calculating linguistic complexity. Bioinformatics, 18, 679-688.

Turner,C., Killoran,C., Thomas,N., Rosenberg,M., Chuzhanova,N., Johnston,J., Kemel,Y., Cooper,D., Biesecker,L. (2003) Human genetic disease caused by de novo mitochondrial-nuclear DNA transfer. Human Genetics 112: 303-309.

Varre,J.S., Delahaye,J.P. and Rivals,E. (1999) Transformation Distances: a family of dissimilarity measures based on movements of segments, Bioinformatics, 15(3), 194-202.

Wan,H., Li,L., Federhen,S. and Wootton,J.C. (2003) Discovering simple regions in biological sequences associated with scoring schemes. J. Comput. Biol., 10, 171-185.

Wan,H. and Wootton,J.C. (2000) A global compositional complexity measure for biological sequences: AT-rich and GC-rich genomes encode less complex proteins. Comput. Chem., 24, 71-94.

Weiss, O., Jimenez-Montano, M. A, and Herzel, H. (2000). Information content of protein sequences. J. Theor. Biol. 206, 379-386.

Wootton,J.C. and Federhen,S. (1996) Analysis of compositionally biased regions in sequence databases. Methods Enzymol., 266, 554-71.

 

 

To search low complexity regions use Low Complexity program

 

Related complexity analysis algorithms: 

 

Complexity by context tree source by Orlov&Potapov, Novosibirsk, Russia (Orlov et al., 2002)

SIMPLE by J.Hancock, Royal Holloway University of London, M.Alba, Universitat Pompeu Fabra, Spain (Alba et al., 2002)

Linguistic Complexity by G.Landau, A.Bolshoy, University of Haifa, Israel (Troyanskaya et al., 2002)

Transformation  Distance by J.Varre, Universite de Lille, France (Varre et al., 1999)

GenCompress, M.Li, S.Kwong, X.Chen, City University of HongKong, China (Chen et al., 1999) 

Old realization of DNA-oriented  Lempel-Ziv algorithm by (Babenko et al., 1999): Complexity profile_builder

Graphical presentation of sequence regularities: OligoRep system, Verbumculus, DeBruijn graphs

 

This resource has been developed in Institute of Cytology and Genetics, based on methods developed in Sobolev Institute of Mathematics, Novosibirsk, Russia 
Authors: Yu.L.Orlov, V.N.Potapov, V.D.Gusev, L.A.Miroshnichenko(Nemytikova)
The research was partially supported by RFBR, Russian Ministry of Education (E 02-6.0-250), and Siberian Branch of the Russian Academy of Sciences (Integration projects 65 and 119).