Pairwise correlation analysis Help Info

CRASP

Pairwise positional correlation analysis

Help information and parameters description

This part of CRASP package allows to estimate the degree of pairwise dependencies of residue's substitutions in terms of linear correlation coefficients between physico-chemical property values at protein positions.

Calculation parameters

Sequence data format

At present the CRASP package allows to input aligned sequences in FASTA format only. Sequence should be represented in a standard 20-letter code and symbol '-' for gaps. Allowed symbols are: ARNDCQEGHILKMFPSTWYV-. Both upper case and lower case letters are accepted. Sequences should be aligned. In case of unaligned sequences length of shortest sequence is assigned to alignment length, all the other sequences are cut. Examples of input sequence data for several protein families are presented here.

Amino acid physico-chemical characteristics

These characteristics of amino acids reflect physical and chemical interactions between residues. At present CRASP package contains the data on 36 physico-chemical scales. User is allowed to select one of the characteristics at this step of analysis.

AAindex number

This field contains the ordinal number of amino acid property from the AAINDEX database. This value could be from 1 to 434 (see list of indices here.)
If this value is set to zero, Amino acid physico-chemical characteristic is selected from menu (see above).
AAINDEX database referernce: Tomii, K. and Kanehisa, M. (1996) Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins. Protein Eng. 9, 27-36 . Internet : http://www.genome.ad.jp/dbget/aaindex.html.

Type of a matrix

This parameter allows to select matrix to be calculated . Three types of matrix could be calculated.

Covariation	Covariation matrix calculation
Linear correlation (default)	Linear correlation coefficients calculation
Partial correlation	Partial correlation coefficients calculation

Variability threshold value

It is evident that if substitutions at the conserved protein positions are rare, then there is no statistically significant relationships between them. Therefore, we may exclude these positions from analysis. The threshold for position variability is expressed as a minimal number of different amino acid types at the corresponding alignment column. Default value is 3.

Gap number threshold

Gaps in alignment are treated as the "missed data". This treatment seems reasonable for the positions with low percentage of deletions. In the CRASP package, these "missed data" are substituted by the mean value of the physico-chemical property at corresponding position. The threshold is expressed as maximal percentage of gaps at the alignment column. Default value is 0 (no gaps are allowed).

Selected sequence number

This parameter is introduced for convenience of representation of results. In result page positions of the alignment are notated as
A-1 L-2 R-3 ... Y-100 ... (AA-type - position ordinal number)
This parameter defines reference sequence that represents amino acid types at the alignment positions. Default value is 1 (Amino acid denotations are taken from the first sequence in alignment).

Number of the first amino acid

This parameter is introduced for convenience of representation of results. It sets the ordinal number of the first amino acid in reference sequence. This value sometimes may not be equal to 1. For example, in case when alignment represents a protein domain or other part of complete protein sequence, a user is allowed to setup the number of the first position of the protein domain. Default value is 1 (reference sequence starts from the the first amino acid in complete sequence).

Weighting sequence data

It is known that over-representation of some homologous sequences in the sample may cause biases in statistical estimates. To avoid such biases, different schemes of sequence weighting were proposed. These approaches reduce the weights of over-represented sequences and imply that "true" distribution of sequences in the sequence space is expected to be homogeneous. The software package CRASP enables to apply different schemes of data weighting. The option is controlled by Weighting type parameter.

Weighting type

Four types of weighting are allowed:

OFF (default)	All the sequence weights are equal to 1
Felsenstein	The method is suggested by Felsenstein (1985) and its calculation is based on evolutionary tree data. If you these data are avaliable, this weighting scheme is recommended
Vingron & Argos	The method suggested by Vingron and Argos (1985)
User defined	The weight coefficients are introduced by the user

Weight data input

For Felsenstein weighting. In this field a user should input the phylogenetic tree in (*.ph) format. This format is supported by many tree-inferring packages, such as CLUSTALW, Phylip,Treecon, etc. If you use the tree, be sure of correspondence between sequence identificators in sequence data and in tree data. In the case of mismatch the CRASP program exits with an error. However, the CRASP package allows for the sequence ID in sequence data to be longer than in tree format, for example, AP1_CHICK_156 (sequence input) and AP1_CHICK (tree input). One can see an example of input data for this weighting scheme

For user defined weighting. In this field, the values for each sequence should be represented in a separate line (default format). However, these weights can be introduced in a free format. In the latter case, a user should define separator-symbols (several symbols are allowed, for example: ;,: ). These symbols should be specified at Separator field.

Output parameters

Type of output

Calculated matrix could be represented as one of the following:

ASCII text file (default)	ASCII text file (with HTML header)
HTML table	HTML-table
Matrix color diagram	Color diagram for matrix elements in GIF-format
Significant pairs	Color diagram for statistically significant correlation coefficients in GIF-format (not defined for covariation matrix)

Significance level

This parameter allows to select threshold for representation of highly correlated pairs of protein positions.

Optional parameters

Clustering highly correlated positions

Show rearranged matrix

This parameter allows to view correlation coefficients for positions forming significant clusters. If this check-box is clicked, then the correlation matrix is shown in additional diagram in rearranged form, so that to make clustered positions closely situated in correlation matrix.

Clustering cut-off value

This parameter specifies cut-off value for correlation coefficient to separate clusters of correlated positions. If the value is 0.0, then the program uses critical value of the correlation coefficient as the cut-off value.

Detecting of martrix regions with high density of correlated pairs

(See some theory here).

Window size

This partameter specifies the window size which is nessesary to locate regions of high correlation value density. If it is set to 0.0, no output will be performed.

Correlation threshold

This parameter specifies at which threshold correlation coefficient will be considered significant. If the value is set to 0.0, then the program uses critical value of the critical value of the correlation coefficient as threshold value.

Correlation sign

Choose the sign of significant correlation to accout. Select positive, negative, or both types of correlations.

Hight correlation dencity significance level

This value specifies significance level of occurrence of highly correlated pairs in a region of correlation matrix, i.e., it specifies the significance level, at which the density of high correlation exceeds the value expected at random.

CRASP main page