Example. Estimation of genetic text complexity

Construction of context tree

The program is designed for estimation of symbolic data compression. 

The sequence to be analyzed should be entered into the text-box in the FASTA-format. 

Let us consider the task of searching for context dependencies in the set of AP-1 binding sites (32 kB, you can download these sequences from Eukaryotic Transcription Factor Binding Sites Compilation - http://wwwmgs.bionet.nsc.ru/mgs/dbases/nsamples/).

This set in a FASTA-format is entered into the input text box. 

Set all other parameters by default. (More detailed description of the program is given in Help page). 

Press button "Execute". The resulting picture in GIF-format is shown in the bottom of this screen.

 

DNA sequences:                                             Amino acid sequences:

Standard alphabet {A,T,G,C}                           2-lettered alphabet (hydrophobic/hydrophilic)  

      2-lettered alphabets:                                     3-lettered charge alphabet (base/neutral/acid )    

Weak/Strong          [AT][GC]                          3-lettered surface alphabet (outer/ambivalent/inner)    

Purine/Pyrimidine    [AG][TC]                            (For example, hydrophobic [AILMFPWV]=0, hydrophilic [RNDCQEGHKSTY]=1)

 

Text in user-defined alphabet

(Type DNA or amino acid symbols groups in brackets, like [at][gc] or  [AILMFPWV][RNDCQEGHKSTY], case is not sensitive

Legend for user-defined alphabet   (By default digits 01234... in the output ) 

(Type one symbol for every group, like for [at][gc]: +- , or WS)

 

 

Input sequences here (FASTA format or plain text)

from Screen (cut & paste)...

or from File:

Preceding context length (1<n<12)

Method of pseudo counts calculation for absent contexts:

Default +0.5 for each absent context:    

+1 count:  Old variant:   No pseudocounts:

Text output of the tree source (Optimized variable memory Markov model for VMM software)

or

 Graphic output

Tree types : Standard tree  or  Round tree 

Letters in image (uncheck if no place for letters in small image)

Width of picture (in pixels, 100<x<2048)      Height of picture (in pixels, 100<y<1024)

                
Help
       Publications   

 

The results of the program execution will be displayed in new HTML-page as the figure shown:

 

If you uncheck parameter for picture:

Graphic output  

and press "Execute" again, he results of the program execution will be displayed in new HTML-page in text format as given below. 

The results contains user-defined parameters for the program, the same source tree as in picture in text format and all selected contexts (oligonucleotides) with their frequencies:

This output can be used for further prediction of DNA functional sites by the VMM program

(just cut and paste text in the corresponding window of model parameters)

Set context len:3
Set: Show all statistics
Set legend for alphabet:
Set model: Rissanen +0.5 pseudo count model
>AP_10001
-------------- Data information --------------
Alphabet       size:4
Alphabet   contains:ATGC
Context      lenght:3
Number of contextes:64
Complexity         :15976.343353
This is tree
in pseudographics (0-absent, 1-present)
/treebegin 4 ATGC
1
1111
1111000011111111
0000000000000000000000000000000000000000000011111111000011110000
(File size without comments) Length=8142+(  2)
Standard compressed file size=2035.5000
Complexity[0][0]=15976.3434
Direct calculation=16243.1322 (Compression = 0.9836)
Entropy=-SUM p*log(p) =1.9927 (*n=16224.3197)
Full context tree complexity (3 levels)=16049.4487 (Compression relative to full tree=0.9954)

+86.492+90.015+303.119+212.223+177.096+88.669+213.017+228.685+292.880+355.262+366.817+290.590+940.399+563.361+973.922+1082.446+1166.418+1243.845+1452.686+914.056+1272.908+3661.437
 Control complexity sum of leaves 15976.3434 == 15976.3434 

--Control complexity sum of leaves 0.0000 == 15976.3434 

 Context - Variability - Complexity
/contextbegin
GCAA= 24 
GCAT= 19 
GCAG= 56 
GCAC= 22 
----------GCA=129 	 24	 19	 56	 22	sum=121	( 86.4921)	( -1.2781)
GCTA= 24 
GCTT= 49 
GCTG= 60 
GCTC= 29 
----------GCT=161 	 24	 49	 60	 29	sum=162	( 90.0154)	( -1.3204)
GCGA= 22 
GCGT= 20 
GCGG= 59 
GCGC= 59 
. . .
----------CC=642 	149	161	122	206	sum=638	( 1272.9083)	( -1.3685)
--Length is 2.Leaves found:9
TA=287 
TT=474 
TG=616 
TC=491 
----------T=1861 	287	474	616	491	sum=1868	( 3661.4369)	( -1.3528)
--Length is 1.Leaves found:1
---Total number of leaves 22
---Simple empirics. Complexity of the tree relative to the full 3-tree is 0.085938
---- Context Length. Leaves found -------
--- 1	1
--- 2	9
--- 3	12
--- 4	0
 

The Institute of Cytology and Genetics (Russia)

This resource has been developed in Institute of Cytology and Genetics, Novosibirsk, Russia
Authors: Yu.L.Orlov, V.P.Filippov, V.N.Potapov
Contributor: S.V.Lavryushev, D.A.Grigorovich
Leader: N.A.Kolchanov

The research was partially supported by the Russian Foundation for Basic Research (RFBR)