Oligonucleotide frequency matrix

Oligonucleotide frequency matrix

For a oligonucleotide frequency matrix construction, the site set {S₁...S_n...S_N } was used.

It contains N nucleotide sequences S_n=s_1n...s_in...s_Ln of L bp in length determined experimentally (where, sО {A, T, G, C}). All these sequences are multiply aligned by the standard Gibbs-potential method (Lawrence, 1994).

The oligonucleotide alphabet {E₁, ..., E_j, ..., E_k} of k pseudoletters E_j={e_1je_2j...e_mj} of m bp in length is fixed (where, eО {A, T, G, C, W=(A, T), S=(G, C), R=(A, G), Y=(T, C), M=(A, C), K=(T, G)}). In these definitions, the oligonucleotide frequency matrix F_L-m+1,k={f_ij} is calculated as follows:

where d(true)=1, d(false)=0.

Formula (1) estimates the frequency value f_ij of the pseudoletter E_j occupying the i-th position within the site sequences multiply aligned, in case the total number of these sequences N is much more than the total number of pseudoletters k in the oligonucleotide alphabets used. That is why, in this work, formula (1) was applied to the alphabets #1, #6, #7, #9, #13, #14, #17, #20, #21, #25, when N>8; to the alphabets #10, #15, #22, when N>25; to the alphabets #2, #4, #16, #26, when N>65; and, finally, to the alphabet #5, when N>200.