LZComposer program: Orlov's TUTORIAL

Complexity decomposition of genetic texts by Lempel-Ziv method.

HELP. How to use program parameters.

This page contains description of of parameters of the input page.

The program is designed for construction of complexity profile based on symbolic data compression by modified Lempel-Ziv algorithm.
The program allows make full complexity decomposition of a sequence by non-overlapping fragments with detailed text output.

Access to the program: LZcomposer

List of biological tasks that could be solved by using the program for estimation complexity based on Lempel-Ziv algotithm:

Search for low complexity regions in nucleotide sequences;
Search for exact repeats of all types: direct, symmetric, direct and inverted complementary;
Comparison of sequences: search for maximal common fragments

Review of other methods of complexity analysis

Data input:

The sequence to be analyzed should be entered into the text-box in the FASTA-format (arrow number 1). The divisor between separate sequences is the line with the first symbol '>'. The sequence can be entered from file of the user's computer by clicking the option 'From file' (number 2).

If there is only a single sequence to be analyzed, you may input this sequence in a plain text format without the comment line. The program has limitations a sequence length 240 Mb (up to size of largest human chromosome). For large sequences please use 'from File' download.

It is recommended to make complexity decomposition for sequence not longer 5-10 Mb due to operating memory restriction. Longer sequences (>12 Mb) could be analyzed by profile method (sliding window) only. size of sliding window is up to 10 Mb.

Program options:

Let us determine the alphabet. It is recommended to use the default alphabet (A, T, G and C; number 1 in the figure) for DNA sequence analysis.
It is possible to use another combinations (for example, A/T, G/C} for studying GC-content in binary alphabet (number 2 and 3).
A user may use another combination of letters by choosing the option 'User defined alphabet'. Then a user may set in a text-box the alphabet (number 7), for example, in a form [AT][GC] or [TC][AG], or A[CGT]. The symbols in square brackets are interpreted as a single symbol. The symbols that are not indicated in alphabet will be ignored.

Predefined DNA alphabets: Weak-Strong DNA alphabet: W=A/T, S=G/C; Purine-Pyrimidine DNA alphabet: R=A/G, Y=T/C; Amino-Keto DNA alphabet: M=A/C, Y=G/T.

A DNA sequence will be converted to user-selected alphabet automatically without any prompt. Program uses digital presentation of symbols for calculation (0,1,2,...) and alphabet is important only for data presentation.

To analyze amino acid sequences, choose the alphabet for analysis of the protein sequence (ACDEFGHIKLMNPQRSTVWY). In this case, the parameters of nucleotide alphabet are ignored.
We suggest the following variants of grouping amino acid residues:

by hydrophobicity-hydrophilicity, i.e. [AILMFPWV] - hydrophobic and [RNDCQEGHKSTY] - hydrophilic (number 4 in the figure)

by charge (number 5 in the figure), [RHK] - basic, [ANCQGILMFPSTWYV] - neutral, [DE] - acidic

by surface location (inner location in protein globule or surface location) [RNDQEHK] - outer, [ACGPSTWY] - ambivalent, and [ILMFV] - inner.

A user may order his own variant of partitioning by ordering in appropriate window the line indicating how to group the symbols (number 7). By ordering by a user of his own alphabet, the residues that are not indicated will be ignored.

User can define own complementarity function (number 8). It means that each letter in user-defined alphabet will have correspondent letter to be used in the compression algorithm. For example, standard complementarity order for {A,T,G,C} could be presented as {T,A,C,G}. The order of letters is important.
It is not recommended to use complementary search parameters for non-DNA alphabet.

Only a single type of the alphabet should be chosen (see numbers 1-7).

The program parameters of calculation method are ordered as indicated by arrows in the figure below:

(1) User should select the calculation method for complexity decomposition. First variant is calculation of complexity profile of a sequence by sliding window (number 1). Default sliding window size is 50 nt. Step of profile equals 50 nt by default. As the result user will obtain an array of digits (currently in text format).

(2) The second variant (number 2 in the figure) is decomposition of the whole sequence (or joined sequences in the sample). The result will be presented as set of non-overlapped sequence fragment with supplementary information about localization and repeat types. The resulting decomposition output is large text (approximately four times greater than the sequence under analysis).

(3) The third variant presents full complexity decomposition of a set of sequences in FASTA-format (number 3).

(4) The fourth variant corresponds to complexity decomposition of all sequences in the set by other sequences (number 4). It means that repeat prototypes are taken from another sequences, but not from the sequence itself as for the methods described above. The result is table of mutual relatedness of sequences in the set measured as edition distance corresponding to complexity decomposition in this case.

(5) The fifth variant of calculation presents complexity decomposition based on all theoretically possible variants of correspondence (complementarity) function (number 5). This method of complexity analysis could be used for comparison of different variants of data compression.

(6) Analysis of one sequence by another one demands second sequence input in separate window (number 6). This variant of complexity decomposition ids intended for analysis of evolutionary relationship between sequences. Two different sets of sequences in FASTA-format could be used also for such decomposition of first set by the second.

In later case user should use both two input windows for first (number 1) and second (number 2) sequences in the program interface (see below). Both "cut and paste" and file downloading inputs are available. The second input window is not need for complexity profile calculation, whole sequence decomposition and its variants (methods 1-5 described above).

Repeat types, output parameters and program execution

User can select any possible combinations of four repeat types for complexity decomposition (number 1). But at least one repeats type should be selected.
Classification of repeats in DNA sequences (short examples):
1) Direct (type D) AGCTTA...AGCTTA
2) Symmetric (type S) AGCTTA...ATTCGA
3) Inverted (symmetric complementary) (type I) AGCTTA...TAACGT
4) Direct Complementary (type C) AGCTTA...TCGAAT
It is recommended to use direct and inverted repeats due to their biological significance.

Calculation based on automatic choice of correspondence (complementarity) function is also available (number 2). All selected repeat types will be ignored in this case.

Parameters for the program output are indicated by numbers 3-7 in the figure. User can select minimal output (number 3). It means only complexity value (integer number, minimal number of non-overlapped fragments). Full detailed text output of all fragments in decomposition is also available (number 4). It is not recommended use full detailed report for longer sequences (longer than 1Mb, like complete microbial genomes). Other variants (number 5-7) correspond to restricted text output and statistic of repeat distribution (for whole text decomposition only). Statistical parameters include distribution of long repeats in decomposition, types and distance between such repeats and so on.

Button "Reset form" allows to reset all checked parameters to default values (number 8).
The program is executed by clicking the button 'Execute' (see number 9 in figure given above).

Data output:

The program outputs the results in a textual format. (Position and complexity value in corresponding sliding window).

Short example of the data output (complexity profile for window 20 bp) looks like column of digits:

...

Another example (whole file decomposition, full detailed report) presents step by step all fragments of complexity decomposition of the sequence (promoter).
The output looks like:

Compose whole file:
1 ([1:1],1,NW,'a')
2 ([1:1],1,I<,'t')
3 ([1:1],1,NW,'g')
4 ([1:3],1,D>,'g')
5 ([1:2],1,I<,'a')
6 ([1:3],1,D>,'g')
7 ([1:3],3,I<,'tcc')
10 ([1:2],1,D>,'t')
11 ([1:2],2,D>,'tg')
13 ([1:4],3,I<,'ctc')

...

References to algorithm description:

Publications

and implementations: promoter complexity, complete bacterial genomes complexity comparison

Comments and questions are welcome to Yu.Orlov.