Significant features of the DNA/RNA site sequences

Since experimental data on functional DNA site activity documented within the database ACTIVITY are considerably limited by concrete experimental conditions, both training and control data sets should be constructed by dividing this data set.

The training data set is analysed by the following algorithm.

(1) By using each DNA sequence S_{n} with activity value F_{n}, included into the training data set {S_{n}, F_{n}}, all the possible __sequence-dependent features__ {X_{k}(S_{n})} are calculated by __exhaustive sorting out of features__.

(2) For each fixed sequence-dependent DNA feature X_{k}, this step provides an optimization of the coefficient pair (f_{0k}, f_{1k}) denoting the simple regression {F_{k}(S)=f_{0k}+f_{1k´
}X_{k}(S)}.

(3) A comparison between the predicted and experimental activities {F_{k}(S_{n}), F_{n}} for the feature X_{k} is made by calculation of the quantitative mark U(X_{k}, F) called an __utility__.* *

If neither U(X_{k}, F)>0, nothing is selected at this step, this being the negative result of the algorithm. Otherwise, for each two features X_{k} and X^{#}_{m}, linearly correlating to each other and possessing by utilities 0<U(X_{k},F)<U(X^{#}_{m}, F), the feature X_{k} with the lowest utility is discarded. This gives linearly independent DNA features {X^{#}_{m}} with the highest positive utilities {U(X^{#}_{m},F)>0} for predicting the activity F. Hence, this is the positive result of the algorithm.