CRASP

Integral linear characteristics analysis

Help information and parameters description

This part of CRASP package allows to estimate the constancy of certain protein physico-chemical characteristic. This characteristic is defined as linear combination of physico-chemical amino acid property values at protein positions.

Calculation parameters

Sequence data format

At present the CRASP package allows to input aligned sequences in FASTA format only. Sequence should be represented in a standard 20-letter code and  symbol '-' for gaps. Allowed symbols are: ARNDCQEGHILKMFPSTWYV-. Both upper case and lower case letters are accepted. Sequences should be aligned. In case of unaligned sequences length of shortest sequence is assigned to alignment length, all the other sequences are cut. Examples of input sequence data for several protein families are presented here.

Amino acid physico-chemical characteristics

These characteristics of amino acids reflect  physical and chemical interactions between residues. At present CRASP package contains the data on 36 physico-chemical scales. User is allowed to select one of the characteristics at this step of analysis.

AAindex number

This field contains the ordinal number of amino acid property from the AAINDEX database. This value could be from 1 to 434 (see list of indices here).
If  this value is  zero, Amino acid physico-chemical characteristic is selected from menu (see above).
AAINDEX database  reference: Tomii, K. and Kanehisa, M. (1996) Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins. Protein Eng. 9, 27-36 . Internet : http://www.genome.ad.jp/dbget/aaindex.html.

Number of randomized samples

This parameter allows to setup number of simulated randomized samples to estimate the conservation of particular integral characteristic. For the characteristic including many positions and for huge sequence samples, the calculations are time-consuming, so we recommend to use small values for this parameter.

Weighting sequence data

It is known that over-representation of some homologous sequences in the sample may cause biases in statistical estimates. To avoid such biases, different schemes of sequence weighting were proposed. These approaches reduce the weights of over-represented sequences and imply that "true" distribution of sequences in the sequence space is expected to be homogeneous. The software package CRASP enables to apply different schemes of data weighting. The option is controlled by Weighting type parameter.

Weighting type

Four  types of weighting are allowed:

OFF   (default)

All the sequence weights are equal to 1

Felsenstein

The method is suggested by Felsenstein (1985) and  its calculation  is based on evolutionary tree   data. If you these data are avaliable, this weighting scheme is recommended

Vingron & Argos

The method suggested by Vingron and Argos (1985)

User defined

The weight coefficients are introduced by user

Weight data input

Felsenstein weighting. In this field, input the phylogenetic tree in (*.ph) format. This format is supported by many tree-inferring packages such as CLUSTALW, Phylip,Treecon, etc. If you use the tree, be sure of  correspondence between sequence identificators in sequence data and in tree data. In tcase of mismatch, the CRASP program exits with  an error. However, the CRASP package allows for the sequence ID in sequence data to be longer than in tree format, for example, AP1_CHICK_156 (sequence input) and AP1_CHICK (tree input). See an example of input data for this weighting scheme

User defined weighting. In this field, the values for each sequence should be input in a separate line (default format). However, these weights can be introduced in a free format. In this case, define separator-symbols (e.g., ;,: ). Specify these symbols in Separator field.

 

Integral protein characteristics description

This characteristic is defined as linear combination of selected physico-chemical amino acid property values at protein positions. Four characteristics are available for analysis simultaneously. Define characteristic name and description.

Characteristic name

For convenience, you may assign specific names as a character string up to 50 symbols for integral characteristics.

Characteristic description

To setup integral physico-chemical characteristics, use the format:

x1(npos1); x2(npos2); ...; xn(nposn);

xi

Arbitrary numbers in a floating point format

nposi

Corresponding positions of alignment enumerated in an arbitrary form (using ',' and '-' symbols), for example: (1-3,4,5,30-44) denotes positions from 1 to 5 and from 30 to 44.

Examples of characteristics:

Net value of a certain amino acid characteristic at the alignment positions 6-8 :

F1 Net value 1.(6-8);

 

Projection of alpha helical momentum (for helix positions 1 to 5 ):

F1 Helix Momentum 1.(1); -0.17(2); -0.94(3); 0.5(4); 0.77(5);

 

where cos(0 )=1; cos(100 )=-0.17; cos(200 )=-.94; ....

Output parameters

Two output data modes are allowed:

Text (default)

ASCII text file (with HTML header)

Graphic

Plots in GIF-format

ASCII-text format (convenient for further data analysis and their graphical representation by statistical packages (Excel, Statistica, etc.).

Output data include:

  1. Means and variances  of integral characteristics
  2. If number of characteristics exceeds 1, pairwise correlation coefficients are represented for all possible pairs of characteristics
  3. If number of characteristics exceeds 1, linear regression parameters are represented according to the linear functional relationship model
  4. Values of characteristics for all proteins in a sample
  5. Comparison of the dispersion values of characteristics between original data sets and expected by random model. Among these characteristics are: Fi dispersion values for original sample (ORIGINAL); calculated Fi dispersion values expected for independent substitution (RND_EXPECT); mean (RND_MEAN) and standard deviation (RND_STDEV) of the Fi dispersion values in simulated sequence samples; estimated probability that dispersion value for the sample with independent substitutions is greater than for the original sample ( P{F(RND)>F(ORIG)} ). This comparison allows to estimate the degree of constancy (variability) of particular integral characteristic.
  6. For correlation coefficients between (Fi, Fj) in randomized samples, their mean, standard deviation values and the probability that correlation coefficient in random samples is greater by absolute value than in original sample  (Correlation coefficients: Mean; Correlation coefficients: Std.Dev.;  P{|Cor(rand)|>|Cor(orig)|} )
  7. Histograms (in text format) for Fi dispersion (variance) values distribution in randomized samples (first column corresponds to Drand(Fi), second, to frequency value)

Graphic output displays different types of plots, characterizing the distribution of integral characteristics both in original and simulated data samples. To select a plot, click the appropriate check-box.

F distribution in original set

These plots represent Fi values distribution in original protein sequence sample (as a histogram)

Fi vs Fj pairwise scatterplots

These plots represent (Fi, Fj) values mutual distribution in original protein sequence sample. Each point represents a pair of (Fi, Fj) characteristic values for particular protein in the sample (scatterplots)

D(F) distribution in randomized samples

These plots represent the Fi values dispersion distribution in simulated random samples (as a histogram)

Dexp(F)/D(F) distribution in randomized samples

These plots represent the ratio Dexp(F)/D(F) values distribution in simulated random samples (as a histogram)

 

CRASP main page