Significant features of the DNA/RNA site sequences

Since experimental data on functional DNA site activity documented within the database ACTIVITY are considerably limited by concrete experimental conditions, both training and control data sets should be constructed by dividing this data set.

The training data set is analysed by the following algorithm.

(1) By using each DNA sequence Sn with activity value Fn, included into the training data set {Sn, Fn}, all the possible sequence-dependent features {Xk(Sn)} are calculated by exhaustive sorting out of features.

(2) For each fixed sequence-dependent DNA feature Xk, this step provides an optimization of the coefficient pair (f0k, f1k) denoting the simple regression {Fk(S)=f0k+f1k´ Xk(S)}.

(3) A comparison between the predicted and experimental activities {Fk(Sn), Fn} for the feature Xk is made by calculation of the quantitative mark U(Xk, F) called an utility.

If neither U(Xk, F)>0, nothing is selected at this step, this being the negative result of the algorithm. Otherwise, for each two features Xk and X#m, linearly correlating to each other and possessing by utilities 0<U(Xk,F)<U(X#m, F), the feature Xk with the lowest utility is discarded. This gives linearly independent DNA features {X#m} with the highest positive utilities {U(X#m,F)>0} for predicting the activity F. Hence, this is the positive result of the algorithm.