REGULATORY GENOMIC SEQUENCES, DATABASES
AND TOOLS FOR
ANALYSIS AND RECOGNITION.

A. E. Kel, N. A. Kolchanov, O. V. Kel, Yu. Kondrakhin, F. A. Kolpakov, S. V. Lavryushov, M. P. Ponomarcnko, E. Wingender*),

Institute of Cytology and Genetics, Novosibirsk, Russia, E.mail: kel@bionet.nsc.ru

*)Gesellschaft fur Biotechnologische Forschung mbH, Maschroder Weg I. J3-38124 Braunschweig, Germany, E.mail: ewi@gbf-braunschweig.de

In the past decade, a prolific growth of new data on molecular mechanisms of regulation of eukaryotic gene expression has taken place. Gene expression on transcriptional level is mainly regulated by sequence-specific interactions of transcription factors with their target sites (cis-elements) located in gene transcription regulatory regions.

At present, the information on regulatory sequences in eukaryotic genomes is vigorously accumulated in many specialized databases: EPD, TFD, TRANSFAC [1], TRRD [2], COMPEL [3] and in the sequence databases: EMBL and GeneBank.

There is a serious drawback in manipulating the databases: they are poorly linked.

To provide the comprehensive research on mechanisms controlling eukaryotic gene expression on the transcriptional level we have developed two databases: TRRD (Transcription Regulatory Region Database) for accumulation of the data  on structure- function organization of gene regulatory regions, and COMPEL, the database on composite regulatory elements that contains contiguous or overlapping binding sites for different transcription factors from different regulatory pathways. In these databases, we collect the data concerning various features of  gene expression regulation, gene classifications, structure of the gene regulatory regions, cis-elements, composite elements, promoters, and enhancers. The links between TRRD, COMPEL and TRANSFAC were recently set up.

We have developed the FUNSYTE [4] computer toolbox for analysis and recognition of regulatory genomic sequences. This toolbox contains software running under DOS and Windows. It provides: (i) Access to the databases on transcriptional regulation of eukaryotic genes, TRANSFAC, TRRD, COMPEL (in relation model), and to the sequence databases.

(ii) Extraction of the information from the databases and anchoring of the sequences in the databases and preparing samples of the regulatory genomic sequences. (iii) Analysis of regulatory genomic sequences with the software. The toolbox has:
(i) software for the analysis of inner structure of the regulatory genomic sequences (oligonucleotide context features; information measures; correlation of base-in-position frequences;  local site alignment; and calculation of DNA conformational parameters);
(ii) software for the development of recognition methods for regulatory genomic sequences (constructing consensuses, recognition groups of homologues, nucleotide and oligonucleotide weight matrices; subsampling by clasterization procedures; and developing recognition methods by means of pattern recognition methods: perceptron. Fisher discriminate, and SITE VIDEO [5]); (iii) software for applying these methods for identification of the regulatory sequences in newly discovered nucleotide sequences from eukaryotic genomes (search for potential binding sites for transcription factors; search for potential composite elements; search for potential MAR sites and nucleosome binding sites; search for potential promoter sequences and calculation of transcription regulation potential).

1. Knueppel, R„ Dietze, P., Lehnberg, W., Frech, K. and Wingender, E. ,
(1994). J. Comput. Biol. , 1, 191-98
2. Kel O. V., Romachenko A. G., Kel A. E., Naumochkin A. N., Kolchanov N. A.
Proceedings of the 28th Annual Hawaii International Conference on System
Scienses [HICSS]. (1995), v. 5. Biotechnology Computing, IEE Computer Society
Press, Los Alamos, California, p. 42 -51