Description of algorithm
A set of unaligned nucleotide sequences of a certain regulatory genomic sequence (RGS) is the initial information for the analysis. Degeneracy of the motifs means that they are considered in an expanded 15 single letter-based code.
1 A
A
adenine
2 T T
timidine
3 G G
guanine
4 C C
citosine
5 R
G/A
purine
6 Y
T/C
pirimidine
7 M
A/C
mono
8 K
T/G
keto
9 W
A/T
weak
10 S
G/C
strong
11 B
not A
12 V
not T
13 H
not G
14 D
not C
15 N
any
The method of the search for significant motifs is based on consideration of the
complete vocabulary of the length L for each RGS with subsequent clusterization of
the oligonucleotides belonging to different RGS. If Hamming’s distance R between
oligonucleotides from different sequences is lower than the threshold value ro,
they are united into one class. The consensus in a 15 single letter-based code is
created for each class as follows. The significance of each of the 14 sense signals at
each position is evaluated by binomial criterion, and the signal with the minimal
probability to appear by chance is selected. The oligonucleotide motif obtained by this
procedure is considered significant, if it meets the following conditions:
(1) the fraction f of the RGS containing the motif is higher than a certain given
level fo and
(2) the binomial probability P(n,N) to observe this motif by accidence in n
and more RGS of the N RGS considered is lower than a given significance level a.
Assessing significance for the oligonucleotide motif
The aim of the algorithm described is finding the oligonucleotide motifs that are significantly presented in a set of RGS (regulatory genome sequences) and, therefore, may play specific biological role.
Let us consider a sequence Si of length L with the nucleotide
frequencies Pa ,Pt ,Pg and Pc , respectively.
The frequency of a letter in the expanded 15 single letter-based code, K, can
be determined as the sum of the nucleotide frequencies included into this letter (for
examples Ps = Pg+Pc).
Let us consider an oligonucleotide motif M=m1,m2,..,ml
(miI K) of length l. The probability of this motif to
occur in the region of length l of the sequence is:
If the expected number of a particular oligonucleotide occurrences in a sequence calculated as (L-l+1)*P(M), less than 1, as is our case, the probability of the motif in question to occur at least once in the sequence Si can be approximated by Poisson distribution:
Consider the set of the sequences S = S1,…,SN. The binomial probability P(n,N) to observe the motif Ì in n (0<=n<=N) sequences is:
.