Description of algorithm

A set of unaligned nucleotide sequences of a certain regulatory genomic sequence (RGS) is the initial information for the analysis. Degeneracy of the motifs means that they are considered in an expanded 15 single letter-based code.

1        A         A         adenine
2        T        T         timidine
3        G        G         guanine
4        C        C         citosine
5        R         G/A         purine
6        Y         T/C         pirimidine
7        M         A/C         mono
8        K         T/G         keto
9        W         A/T         weak
10        S         G/C         strong
11        B         not A
12        V         not T
13        H         not G
14        D         not C
15        N         any

The method of the search for significant motifs is based on consideration of the complete vocabulary of the length L for each RGS with subsequent clusterization of the oligonucleotides belonging to different RGS. If Hamming’s distance R between oligonucleotides from different sequences is lower than the threshold value ro, they are united into one class. The consensus in a 15 single letter-based code is created for each class as follows. The significance of each of the 14 sense signals at each position is evaluated by binomial criterion, and the signal with the minimal probability to appear by chance is selected. The oligonucleotide motif obtained by this procedure is considered significant, if it meets the following conditions:
(1) the fraction f of the RGS containing the motif is higher than a certain given level fo and
(2) the binomial probability P(n,N) to observe this motif by accidence in n and more RGS of the N RGS considered is lower than a given significance level a.

Assessing significance for the oligonucleotide motif

The aim of the algorithm described is finding the oligonucleotide motifs that are significantly presented in a set of RGS (regulatory genome sequences) and, therefore, may play specific biological role.

Let us consider a sequence Si of length L with the nucleotide frequencies Pa ,Pt ,Pg and Pc , respectively. The frequency of a letter in the expanded 15 single letter-based code, K, can be determined as the sum of the nucleotide frequencies included into this letter (for examples Ps = Pg+Pc).
Let us consider an oligonucleotide motif M=m1,m2,..,ml (miI K) of length l. The probability of this motif to occur in the region of length l of the sequence is:

If the expected number of a particular oligonucleotide occurrences in a sequence calculated as (L-l+1)*P(M), less than 1, as is our case, the probability of the motif in question to occur at least once in the sequence Si can be approximated by Poisson distribution:

Consider the set of the sequences S = S1,…,SN. The binomial probability P(n,N) to observe the motif Ì in n (0<=n<=N) sequences is:

.