RecGroup program

Aim:

This program is aimed on the construction of Recognition Groups which are used further by the <RGSiteScan> for obtaining the final result, that is, the revealing of potential binding sites in a DNA sequence. So this is the first program in a two-stage process of revealing of transcription factor binding sites.

Comments:

Recognition Group (RG) represents a high homology set of oligonucleotides of equal lengths. Recognition Group retains most of the information and is, thus, well suited to describe the transcription factor binding sites.

Recognition Group is built on the basis of the sample of non-aligned DNA sequences of various lengths, which a priori contain the binding sites. There may be no information about the location and strand orientation of the sites contained in these sample sequences. This is the advantage of the algorithm used compared with the standard approaches, such as consensus method and nucleotide frequency matrix approaches.

Each oligonucleotide from RG is accomplished with the weight that reflects a proportion of sample sequences containing this oliginucleotide in an arbitrary orientation.

The site length l (i.e. oligonucleotide length) is an estimable parameter as well as the degree of homology T. So, one of the results provided by this program is a precisely defined site length. Both of these parameters are estimated by means of original two-parametrical optimization procedure adapted to the task pursued.

EXAMPLE

Input:

The expert set for NF-Y transcription factor.

ID of the sequences in EMBL

Sequences of the expert set

TRANSFAC site numbers

1) MMAGL1 agttttactgggtagagcaagcacaaaccAGCCAATgagtaactgctccaa

R00511

2) HSGAMGLOA *) ttGACCAATagccttgacaaggcaaacttGACCAATagtcttagagtatcc

R00564

3) HSHBA1 accacccctgcagccccctcccctcacctGACCAATggccacagcctggct

R00572

4) HSHSP70D cagcctcatcgagctcggtgATTGGCTcagaagggaaaaggcgggtctccg

R00766

5) HSGAMGLOA(1) *) cccatgggttggccagcCTTGCCTtGACCAATagccttgacaaggcaaact

R01858

6) HSDNAPOL gcagcctcccgagccgctgATTGGCTttcaggctggcgcctgtctcggccc

R03038

7) MMTHY11G accccctccatccttttccctcagcctccgATTGGCTgaatctagagtccc

R03043

8) HSEGL1 acccctgaggacacaggtcagccttGACCAATgacttttaagtaccatgga

R03119

9) HSGP91 tgttatggatgcaagcttttcagttGACCAATgattattAGCCAATttctg

R03477

10) SCCYC1G5 gaagaccaagcgccagctcatttggcgagcGTTGGTTggtggatcaagccc

R00263

11) AD2 ttcggcatcaaggaaggtgATTGGTTtataggtgtaggccacgtgaccggg

R00988

12) MMMHIEDA tctagtttaataatttcaggagcagAACCAATcagcagataggaactcggc

R01076

13) MMMHEKA agtctgaaacatttttctgATTGGTTaaaagttgagtgctttggattttaa

R01080

14) MMMHAA gggagttcccctAGCCTCTtccaggcctcctaatacaaagtctgcagctgg

R01942

15) HE1CG catcagcagacAGGCAAGctcaaagtccaggaggtccctggggttgaacag

R01446

16) MMC1A2 tggggagagATTGCATctgttctggaggggacagcttgggatgttaaggaa

R00231

17) HSNPY tggggctgtccggactgaccctcgccctgtccctgctcgtgtgcctgggtg

R02028

18) MMA1COL gccgggccaggcagttctgATTGGCTgggggccgggctgctggctccccct

R00228

19) RNFBG5E gtaaagagaccccgtgaccagttccAGCCACTctttagtcccgcccagact

R00445

20) RNGBA2UA gcaccggtgtacattgctcaggatgtAGCCATGtgagaaggcagacttatg

R00577

21) RNALOBG1 tgattacaaagATTGGCTgttcacgcgccaatcagagttattgaataaaca

R01799

22) HSAPOA01 cctgcagcccccgcagcttgctgtttgcccactctATTTGCCcagccccag

R02959

*) the EMBL-sequence HSGAMGLOA contains two different actual binding sites simultaneously

Output:

The recognition group consisting of 10 oligonuleotides l=7bp long for the description of NF-Y (CCAAT-binding factor) transcription factor binding site:

 

Oligonuleotide weight

AGCCAAT

(7)

gaCCAAT

(4)

AaCCAAT

(3)

AGCCAcT

(1)

AaCCAAc

(1)

AGCCtcT

(1)

AGgCAAg

(1)

AtgCAAT

(1)

AGCCAtg

(1)

gGCaAAT

(1)

The degree of homology of a given RG is T=2. It means that any non-maximum weighted olgionucleotide differs from the maximum weighted nucleotide AGCCAAT at no more than 2 mismatches.

 Algorithm Designer>> Yury Kondrakhin

E-mail meWeb Master