**Description of algorithm**

A set of unaligned nucleotide sequences of a certain regulatory genomic sequence (RGS) is the initial information for the analysis. Degeneracy of the motifs means that they are considered in an expanded 15 single letter-based code.

1 A
A
adenine

2 T T
timidine

3 G G
guanine

4 C C
citosine

5 R
G/A
purine

6 Y
T/C
pirimidine

7 M
A/C
mono

8 K
T/G
keto

9 W
A/T
weak

10 S
G/C
strong

11 B
not A

12 V
not T

13 H
not G

14 D
not C

15 N
any

The method of the search for significant motifs is based on consideration of the
complete vocabulary of the length *L* for each RGS with subsequent clusterization of
the oligonucleotides belonging to different RGS. If Hamming’s distance *R* between
oligonucleotides from different sequences is lower than the threshold value *r _{o},
*they are united into one class. The consensus in a 15 single letter-based code is
created for each class as follows. The significance of each of the 14 sense signals at
each position is evaluated by binomial criterion, and the signal with the minimal
probability to appear by chance is selected. The oligonucleotide motif obtained by this
procedure is considered significant, if it meets the following conditions:

(1) the fraction

(2) the binomial probability

**Assessing significance for the oligonucleotide motif**

The aim of the algorithm described is finding the oligonucleotide motifs that are significantly presented in a set of RGS (regulatory genome sequences) and, therefore, may play specific biological role.

Let us consider a sequence *S _{i}* of length

Let us consider an oligonucleotide motif

If the expected number of a particular oligonucleotide occurrences in a sequence
calculated as *(L-l+1)*P(M),* less than 1, as is our case, the probability of the
motif in question to occur at least once in the sequence *Si* can be approximated by
Poisson distribution:

Consider the set of the sequences *S = S _{1},…,S_{N}.* The
binomial probability

.