Home

Significant features of the DNA/RNA site sequences

Since experimental data on functional DNA site activity documented within the database ACTIVITY are considerably limited by concrete experimental conditions, both training and control data sets should be constructed by dividing this data set.

The training data set is analysed by the following algorithm.

(1) By using each DNA sequence S_n with activity value F_n, included into the training data set {S_n, F_n}, all the possible sequence-dependent features {X_k(S_n)} are calculated by exhaustive sorting out of features.

(2) For each fixed sequence-dependent DNA feature X_k, this step provides an optimization of the coefficient pair (f_0k, f_1k) denoting the simple regression {F_k(S)=f_0k+f_1k´X_k(S)}.

(3) A comparison between the predicted and experimental activities {F_k(S_n), F_n} for the feature X_k is made by calculation of the quantitative mark U(X_k, F) called an utility.

If neither U(X_k, F)>0, nothing is selected at this step, this being the negative result of the algorithm. Otherwise, for each two features X_k and X^#_m, linearly correlating to each other and possessing by utilities 0<U(X_k,F)<U(X^#_m, F), the feature X_k with the lowest utility is discarded. This gives linearly independent DNA features {X^#_m} with the highest positive utilities {U(X^#_m,F)>0} for predicting the activity F. Hence, this is the positive result of the algorithm.