CRASP |
Pairwise positional correlation analysis Help information and parameters description |
This part of CRASP package allows to estimate the degree of pairwise dependencies of residue's substitutions in terms of linear correlation coefficients between physico-chemical property values at protein positions.
Calculation parameters
At present the CRASP package allows to input aligned sequences in FASTA format only. Sequence should be represented in a standard 20-letter code and symbol '-' for gaps. Allowed symbols are: ARNDCQEGHILKMFPSTWYV-. Both upper case and lower case letters are accepted. Sequences should be aligned. In case of unaligned sequences length of shortest sequence is assigned to alignment length, all the other sequences are cut. Examples of input sequence data for several protein families are presented here.
Amino acid physico-chemical characteristics
These characteristics of amino acids reflect physical and chemical interactions between residues. At present CRASP package contains the data on 36 physico-chemical scales. User is allowed to select one of the characteristics at this step of analysis.
This field contains the ordinal number of amino acid
property from the AAINDEX database. This value could be from 1 to 434 (see list of
indices here.)
If this value is set to zero, Amino
acid physico-chemical characteristic is selected from menu (see above).
AAINDEX database referernce: Tomii, K. and Kanehisa, M. (1996) Analysis of amino acid indices and mutation matrices for sequence
comparison and structure prediction of proteins. Protein Eng. 9,
27-36 . Internet : http://www.genome.ad.jp/dbget/aaindex.html.
This parameter allows to select matrix to be calculated . Three types of matrix could be calculated.
Covariation |
Covariation matrix calculation |
Linear correlation (default) |
Linear correlation coefficients calculation |
Partial correlation |
Partial correlation coefficients calculation |
It is evident that if substitutions at the conserved protein positions are rare, then there is no statistically significant relationships between them. Therefore, we may exclude these positions from analysis. The threshold for position variability is expressed as a minimal number of different amino acid types at the corresponding alignment column. Default value is 3.
Gaps in alignment are treated as the "missed data". This treatment seems reasonable for the positions with low percentage of deletions. In the CRASP package, these "missed data" are substituted by the mean value of the physico-chemical property at corresponding position. The threshold is expressed as maximal percentage of gaps at the alignment column. Default value is 0 (no gaps are allowed).
This parameter is introduced for convenience of representation of
results. In result page positions of the alignment are notated as
A-1 L-2 R-3 ...
Y-100 ... (AA-type - position ordinal number)
This parameter defines reference sequence that represents amino acid types at the
alignment positions. Default value is 1 (Amino acid denotations are taken from the first
sequence in alignment).
Number of the first amino acid
This parameter is introduced for convenience of representation of results. It sets the ordinal number of the first amino acid in reference sequence. This value sometimes may not be equal to 1. For example, in case when alignment represents a protein domain or other part of complete protein sequence, a user is allowed to setup the number of the first position of the protein domain. Default value is 1 (reference sequence starts from the the first amino acid in complete sequence).
Weighting sequence data
It is known that over-representation of some homologous sequences in the sample may cause biases in statistical estimates. To avoid such biases, different schemes of sequence weighting were proposed. These approaches reduce the weights of over-represented sequences and imply that "true" distribution of sequences in the sequence space is expected to be homogeneous. The software package CRASP enables to apply different schemes of data weighting. The option is controlled by Weighting type parameter.
Four types of weighting are allowed:
OFF (default) |
All the sequence weights are equal to 1 |
Felsenstein |
The method is suggested by Felsenstein (1985) and its calculation is based on evolutionary tree data. If you these data are avaliable, this weighting scheme is recommended |
Vingron & Argos |
The method suggested by Vingron and Argos (1985) |
User defined |
The weight coefficients are introduced by the user |
For Felsenstein weighting. In this field a user should input the phylogenetic tree in (*.ph) format. This format is supported by many tree-inferring packages, such as CLUSTALW, Phylip,Treecon, etc. If you use the tree, be sure of correspondence between sequence identificators in sequence data and in tree data. In the case of mismatch the CRASP program exits with an error. However, the CRASP package allows for the sequence ID in sequence data to be longer than in tree format, for example, AP1_CHICK_156 (sequence input) and AP1_CHICK (tree input). One can see an example of input data for this weighting scheme
For user defined weighting. In this field, the values for each sequence should be represented in a separate line (default format). However, these weights can be introduced in a free format. In the latter case, a user should define separator-symbols (several symbols are allowed, for example: ;,: ). These symbols should be specified at Separator field.
Output parameters
Calculated matrix could be represented as one of the following:
ASCII text file (default) |
ASCII text file (with HTML header) |
HTML table |
HTML-table |
Matrix color diagram |
Color diagram for matrix elements in GIF-format |
Significant pairs |
Color diagram for statistically significant correlation coefficients in GIF-format (not defined for covariation matrix) |
This parameter allows to select threshold for representation of highly correlated pairs of protein positions.
Optional parameters
Clustering highly correlated positions
This parameter allows to view correlation coefficients for positions forming significant clusters. If this check-box is clicked, then the correlation matrix is shown in additional diagram in rearranged form, so that to make clustered positions closely situated in correlation matrix.
This parameter specifies cut-off value for correlation coefficient to separate clusters of correlated positions. If the value is 0.0, then the program uses critical value of the correlation coefficient as the cut-off value.
Detecting of martrix regions with high density of correlated pairs
This partameter specifies the window size which is nessesary to locate regions of high correlation value density. If it is set to 0.0, no output will be performed.
This parameter specifies at which threshold correlation coefficient will be considered significant. If the value is set to 0.0, then the program uses critical value of the critical value of the correlation coefficient as threshold value.
Choose the sign of significant correlation to accout. Select positive, negative, or both types of correlations.
Hight correlation dencity significance level
This value specifies significance level of occurrence of highly correlated pairs in a region of correlation matrix, i.e., it specifies the significance level, at which the density of high correlation exceeds the value expected at random.