Home

Oligonucleotide alphabets used in the MATRIX database

The oligonucleotide alphabets introduced are listed in the Table below. The canonical nucleotide {A, T, G, C} is the alphabet #1. Berg and von Hippel (1987) have demonstrated that the canonical alphabet frequency matrices of the protein binding sites within DNA reflect the evolution optimization of these site in the course of the DNA/protein affinity maximization. That is why the canonical frequency matrix is commonly accepted and widely used to recognize the protein binding sites in DNA.

Kondrakhin et al. (1995) have generalized this canonical alphabet to the trinucleotide {AAA, AAT, …, CCG, CCC} alphabet (alphabet #3). It was demonstrated that the frequency matrix of the alphabet #3 is informative to describe the site-specific nucleotides combinations located preferably in adjacent positions within the 3’cleavedge pre-mRNA sites and, thus, especially useful to recognize this site (Kondrakhin et al., 1995). We are formally introducing the dinucleotide {AA, AT, …, CG, CC} alphabet #2, because it is naturally ordered as compared to both alphabets, #1 and #3, mentioned above. Since a number of the well known functional site consensuses contain the symbol “any nucleotide” x={A, T, G, C}, we have also inserted this symbol between the canonical nucleotides of the alphabets #2 and #3. This results in appearance of two novel {AxA, AxT, …, CxG, CxC} and {AxAxA, AxAxT, …, CxCxG, CxCxC} alphabets, of the trinucleotide and pentanucleotides, with the holes (Table, alphabets #4 and #5).

The dichotomy alphabets {W=(A, T), S=(G, C)}, {R=(A, G), Y=(T, C)} and {M=(A, C), K=(T, G)} are also commonly accepted to make interpretations of the site structures in terms of the thermodynamic, conformation, and electrostatic features of the sites. These dichotomy alphabets are used in MATRIX database too (Table, alphabets #6, #13 and #20). Similar as the alphabets #2-#5 were designed from the canonical #1 alphabet, the rest oligonucleotide alphabets #7-#12, #14-#19 and #21-#26 do from the dichotomy alphabets #6, #13 and #20, respectively.

Thus, the total number of oligonucleotide alphabets used equals to 26.

 

no.

Name

m

The alphabet {E1,..., Ej,...,Ek-1,Ek} of the k pseudoletters Ej of m bp in length

K

The DNA/protein affinity optimized

(Berg and von Hippel, 1987, J. Mol. Biol., 193, 723-750)

1

N1

1

A, T, G, C

4

Site-specific nucleotide preferences to be in adjacent positions

(Kondrakhin et al., 1995, Comput. Appl. Biosci., 11, 477-488)

2

N2

2

AA, AT, AG, AC, TA, TT, TG, TC, GA, ...., GC, CA, CT, CG, CC

16

3

N3

3

AAA, AAT, AAG, AAC, ATA, ...., CGC, CCA, CCT, CCG, CCC

64

4

N3x

3

AxA, AxT, AxG, AxC, TxA, TxT, ...., GxC, CxA, CxT, CxG, CxC

16

5

N5x

5

AxAxA, AxAxT, AxAxG, AxAxC, ...., CxCxA, CxCxT, CxCxG, CxCxC

64

The thermodynamic property of the functional DNA site

(Ponomarenko et al., 1999, Bioinformatics, 15, 7/8, 631-643)

6

WS1

2

W, S

2

7

WS2

2

WW, WS, SW, SS

4

8

WS3

3

WWW, WWS, WSW, WSS, SWW, SWS, SSW, SSS

8

9

WS4

4

WWWW, WWWS, WWSW, WWSS, ..., SSWS, SSSW, SSSS

16

10

WS3x

3

WxW, WxS, WxS, SxS

4

11

WS5x

5

WxWxW, WxWxS, WxSxW, WxSxS, ..., SxWxS, SxSxW, SxSxS

8

12

WS7x

7

WxWxWxW, WxWxWxS, WxWxSxW, ..., SxSxWxS, SxSxSxW, SxSxSxS

16

The conformation properties of the functional DNA site

(Ponomarenko et al., 1999, Bioinformatics, 15, 7/8, 631-643)

13

RY1

1

R, Y

4

14

RY2

2

RR, RY, YR, YY

4

15

RY3

3

RRR, RRY, RYR, RYY, YRR, YRY, YYR, YYY

8

16

RY4

4

RRRR, RRRY, RRYR, RRYY, ..., YYRR, YYRY, YYYR, YYYY

16

17

RY3x

3

RxR, RxY, YxR, YxY

4

18

RY5x

5

RxRxR, RxRxY, RxYxR, RxYxY, ..., YxRxY, YxYxR, YxYxY

8

19

RY7x

5

RxRxRxR, RxRxRxY, RxRxYxR, ..., YxYxRxY, YxYxYxR, YxYxYxY

8

The electrostatic properties of the functional DNA site

(Ponomarenko et al., 1999, Bioinformatics, 15, 7/8, 631-643)

20

MK1

1

M, K

4

21

MK2

2

MM, MK, KM, KK

4

22

MK3

3

MMM, MMK, MKM, MKK, KMM, KMK, KKM, KKK

8

23

KM4

4

MMMM, MMMK, MMKM, MMKK, ..., KKMK, KKKM, KKKK

16

24

MK3x

3

MxM, MxK, KxM, KxK

4

25

MK5x

5

MxMxM, MxMxK, MxKxM, MxKxK, ..., KxKxM, KxKxK

8

26

MK7x

7

MxMxMxM, MxMxMxK, MxMxKxM, ..., KxKxKxM, KxKxKxK

16

Note: M=(A, C), K=(G, T), R=(A, G), Y=(T, C), W=(A, T), S=(G, C), x=(A, T, G, C).