**The search for DNA regions with low complexity is one of
pivotal tasks of modern structural analysis of complete genomes. The low
complexity may be preconditioned by strong inequality in nucleotide content
(biased composition), by tandem or dispersed repeats, by palindrome-hairpin
structures, as well as by combination of all these factors. Numerical measures
of text complexity including combinatorial and linguistic ones, together with
complexity estimation by modified Lempel-Ziv algorithm are reviewed. Methods
were implemented in a software tools LZcomposer
and LowComplexity.
**

Intuitively, complexity of symbolic sequence reflects an
ability to represent a sequence in a compact form based on some structural
features of this sequence. To evaluate text complexity, several groups of
methods were developed: entropy measures (Shannon, 1948), with the simplest of
them using only the alphabet symbol frequencies (Wootton&Federhen, 1996);
method of clusterization of cryptically simple sequences (Hancock&Armstrong,
1994; Alba et al., 2002);
evaluation of the alphabet capacity *l*-gram (combinatorial complexity and
linguistic complexity) (Milosavljevic&Jurka,1993; Kisliuk et al., 1999; Gabrielian&Bolshoy, 1999;
Troyanskaya et al., 2002); modifications of complexity measure by Lempel and Ziv
(Gusev et al., 1991; 1999; Chen et al., 1999); stochastic complexity (Orlov et
al., 2002); and grammatical complexity (Jimenez-Montano et al., 2002).

The general approach to estimating complexity of symbolic sequences (texts) was suggested by A.N. Kolmogorov (Kolmogorov, 1965). He proved that there exists an optimal algorithm or a program for the text generation. Kolmogorov complexity is the length of the shortest code generating given sequence. Kolmogorov complexity is not recursive function (is not realized in computational scheme). However, for the sequence of finite length, various constructive realizations of non-optimal coding were developed (Lempel and Ziv, 1976; Ebeling and Jimenes-Montano, 1980; Gusev et al., 1991; 1999; Babenko et al., 1999; Chen et al. 1999). Approaches to genetic sequence analysis based on compression algorithms, has been suggested also by Grumbach and Tahi, Milosavljevic and Rivals (Rivals et al., 1996). At whole information content of genetic sequences was intensively studied (Allison et el., 2000).

There have been a considerable number of theoretical studies on information measures and complexity for genetic macromolecules (Ebeling and Jimenez-Montano, 1980). Compression algorithms applied to DNA sequences were discussed by Grumbach and Tahi (Grumbach and Tahi, 1994). Entropies of DNA sequences have been discussed in (Mantegna et al., 1994; Schmitt and Herzel, 1997). The mutual information, a measure intimately related to entropies, has been successfully used to predict protein coding regions in DNA (Grosse et al., 2000). A more systematic approach is based on the reconstruction of the word frequencies using frequencies of other, shorter sub-words (Gorban et al., 2000). A method for sequences comparison based on the specific entropy of a frequency dictionary was suggested by M.Sadovsky (Sadovsky, 2003). Calculations made on several genomes detected no significant differences of reconstructed frequencies from real ones. The complexity of large sets of non-redundant protein sequences was measured in (Weiss et al., 2000); there it was shown that proteins are fairly close to random sequences.

As the method for complexity evaluation, we have chosen the scheme of the text representation in terms of repeats, which uses the concept of complexity of a finite symbolic sequence, introduced by Lempel and Ziv (Lempel and Ziv, 1976). The approach of Lempel and Ziv is oriented to development of efficient algorithm for data compression. While studying complexity, we are interested not in a mere compression of genetic texts, but rather in detection of the regularities underlying it. The Lempel–Ziv complexity measure is based on text segmentation; we termed it as complexity decomposition. It may be interpreted as representation of a text in terms of repeats. Initially, this approach was implemented for analyzing DNA by Gusev and co-authors (Gusev et al., 1989; 1991; 1993; 1999). Local sequence complexity is connected with molecular mechanisms of mutagenesis (Chuzhanova et al., 2003).

An computer realization of this measure was suggested
by (Babenko et al., 1999). The program Complexity
profile_builder
is available at **http://wwwmgs/mgs/programs/gc_net/.**

We present **LZcomposer **
as new implementation with additional capabilities connected with long sequences
(complete genomes), flexible profile construction, using different alphabets,
analysis of several sequences and relative complexity estimations for group of
sequences.

In **LowComplexity
**software suggested, we have realized some additional
known estimates of complexity in order to compare different approaches. A user
may choose the method of his interest or construct several complexity profiles
simultaneously. In particular, we have realized the following evaluations of the
text complexity: (1) by nucleotide frequency content (Wootton&Federhen,
1996); (2) by entropy of the given order of words (oligonucelotides); (3) by
linguistic complexity (Troyanskaya et al., 2002). In the latter approach,
linguistic complexity refers to combinatorial complexity denoted as the power of
the *l*-gram dictionary under the fixed *l.* By summing up the values
of combinatorial complexity over all values *l*, 1£*l*£*N*,
where *N* is the sequence length, and dividing by maximal dictionary size
we obtain the value of linguistic complexity. This measure was applied for
studying the patterns composing nucleosomes and promoters (Gabrielian and
Bolshoy, 1999; Troyanskaya et al., 2002).

Entropy-based evaluations of the text complexity (entropy
of the *l*-th order) were introduced by Shannon (Shannon, 1948).
Evaluations based on the variable memory Markov model may serve as a
generalization (Orlov et al., 2002). For analysis of extended sequence
complexity, a program is available that estimates stochastic complexity as the
measure of data compression in the variable memory Markov model.
An optimal variable memory Markov model is determined by a program in a form of
the context source-tree as an optimal model for the text compression. Stochastic
complexity is determined as sum of the size of compressed text and the
complexity of model parameters. The method could be applied for analysis and
comparison of extended sequence complexity (for example, complete bacterial
genomes) and described in (Orlov et al., 2002).

By application of *l*-gram trees for the sequence
representation in our software, we have resolved the computational problems
stemming from a considerable length of the sequences processed. Our package is
designed for analysis of an arbitrary symbolic sequence, including DNA and amino
acid sequences.

The software is designed for effective search for the regions with low complexity in extended DNA sequences. The search is provided by different methods and its operation time is linearly dependent upon the sequence length. The program software enables to calculate average complexity profile for the sets of sequences in the FASTA format. By using the software, we have demonstrated the higher nucleotide sequence complexity of exons and lower complexity of introns (Orlov et al., 2004, in press). Also, we have found the alteration of the local text complexity for the splicing sites.

**Recent
findings by the program in promoter
complexity, complete
bacterial genomes complexity comparison**

**References
**

Ebeling,W. and Jimenez-Montano,M.
A. (1980). On grammars, complexity, and information measures of biological
macromolecules. *Math. Biosci*. 52, 53-71.

Gorban,A.N., Popova,T.G. and
Sadovsky,M.G. (2000). Classification of symbol sequences over their frequency
dictionaries: towards the connection between structure and natural taxonomy. *Open
Sys.& Information Dyn.* 7, 1-17.

Grumbach,S. and Tahi,F. (1994) A new challenge
for compression algorithms: genetic sequences. *J. Inf. Process. Manage*
30, 875-886.

Gusev,V.D., Kulichkov,V.A. and Chupakhina,O.M. (1989) Complexity analysis of genetic texts (with phage lambda as an example). Preprint no.20. Institute of Mathematics Acad. Sci. USSR, Siberian branch, Novosibirsk. (in Russian).

Kolmogorov,A.N. (1965) Three approaches to definition of
information quantity. *Probl. Peredachi Inf.* **1**, 3–11. (in
Russian).

Lempel,A. and Ziv,J. (1976) On the complexity of finite
sequences. *IEEE Trans. Inf. Theory*, **IT-22**, 75–81.

Orlov,Yu.L.,
Filippov,V.P., Potapov,V.N. and Kolchanov,N.A.
(2002) Construction of stochastic context trees for genetic texts. *In Silico
Biology*, **2**, 233-247, <http:www.bioinfo.de/isb/2002/02/0022/>.

Orlov Yu.L., Gusev V.D. and Miroshnichenko L.A. (2004)
LZcomposer: decomposition of genomic sequences by repeat fragments. *Biophysics
(Mosk.),* (in Press)

Shannon, C.E. (1948) A mathematical theory of
communication. *Bell Syst. Tech. J.*, **27**, pt.I., 379-423; pt.II.,
623-656.

**
To search low complexity regions use Low Complexity
program
**

Related complexity analysis algorithms:

Complexity by context tree source by Orlov&Potapov, Novosibirsk, Russia (Orlov et al., 2002)

SIMPLE by J.Hancock, Royal Holloway University of London, M.Alba, Universitat Pompeu Fabra, Spain (Alba et al., 2002)

Linguistic Complexity by G.Landau, A.Bolshoy, University of Haifa, Israel (Troyanskaya et al., 2002)

Transformation Distance by J.Varre, Universite de Lille, France (Varre et al., 1999)

GenCompress, M.Li, S.Kwong, X.Chen, City University of HongKong, China (Chen et al., 1999)

Old realization of DNA-oriented Lempel-Ziv algorithm by (Babenko et al., 1999): Complexity profile_builder

Graphical presentation of sequence regularities: OligoRep system, Verbumculus, DeBruijn graphs

**This resource has been developed in Institute of Cytology
and Genetics, based on methods developed in Sobolev Institute of Mathematics,
Novosibirsk, Russia
Authors: Yu.L.Orlov,
V.N.Potapov, V.D.Gusev,
L.A.Miroshnichenko(Nemytikova)
The research was partially supported by RFBR,
Russian Ministry of
Education (E 02-6.0-250), and Siberian Branch of
the Russian Academy of Sciences (Integration projects 65 and 119).**