Integral Characteristic Analysis Help Info

CRASP

Integral linear characteristics analysis

Help information and parameters description

This part of CRASP package allows to estimate the constancy of certain protein physico-chemical characteristic. This characteristic is defined as linear combination of physico-chemical amino acid property values at protein positions.

Calculation parameters

Sequence data format

At present the CRASP package allows to input aligned sequences in FASTA format only. Sequence should be represented in a standard 20-letter code and symbol '-' for gaps. Allowed symbols are: ARNDCQEGHILKMFPSTWYV-. Both upper case and lower case letters are accepted. Sequences should be aligned. In case of unaligned sequences length of shortest sequence is assigned to alignment length, all the other sequences are cut. Examples of input sequence data for several protein families are presented here.

Amino acid physico-chemical characteristics

These characteristics of amino acids reflect physical and chemical interactions between residues. At present CRASP package contains the data on 36 physico-chemical scales. User is allowed to select one of the characteristics at this step of analysis.

AAindex number

This field contains the ordinal number of amino acid property from the AAINDEX database. This value could be from 1 to 434 (see list of indices here).
If this value is zero, Amino acid physico-chemical characteristic is selected from menu (see above).
AAINDEX database reference: Tomii, K. and Kanehisa, M. (1996) Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins. Protein Eng. 9, 27-36 . Internet : http://www.genome.ad.jp/dbget/aaindex.html.

Number of randomized samples

This parameter allows to setup number of simulated randomized samples to estimate the conservation of particular integral characteristic. For the characteristic including many positions and for huge sequence samples, the calculations are time-consuming, so we recommend to use small values for this parameter.

Weighting sequence data

It is known that over-representation of some homologous sequences in the sample may cause biases in statistical estimates. To avoid such biases, different schemes of sequence weighting were proposed. These approaches reduce the weights of over-represented sequences and imply that "true" distribution of sequences in the sequence space is expected to be homogeneous. The software package CRASP enables to apply different schemes of data weighting. The option is controlled by Weighting type parameter.

Weighting type

Four types of weighting are allowed:

OFF (default)	All the sequence weights are equal to 1
Felsenstein	The method is suggested by Felsenstein (1985) and its calculation is based on evolutionary tree data. If you these data are avaliable, this weighting scheme is recommended
Vingron & Argos	The method suggested by Vingron and Argos (1985)
User defined	The weight coefficients are introduced by user

Weight data input

Felsenstein weighting. In this field, input the phylogenetic tree in (*.ph) format. This format is supported by many tree-inferring packages such as CLUSTALW, Phylip,Treecon, etc. If you use the tree, be sure of correspondence between sequence identificators in sequence data and in tree data. In tcase of mismatch, the CRASP program exits with an error. However, the CRASP package allows for the sequence ID in sequence data to be longer than in tree format, for example, AP1_CHICK_156 (sequence input) and AP1_CHICK (tree input). See an example of input data for this weighting scheme

User defined weighting. In this field, the values for each sequence should be input in a separate line (default format). However, these weights can be introduced in a free format. In this case, define separator-symbols (e.g., ;,: ). Specify these symbols in Separator field.

Integral protein characteristics description

This characteristic is defined as linear combination of selected physico-chemical amino acid property values at protein positions. Four characteristics are available for analysis simultaneously. Define characteristic name and description.

Characteristic name

For convenience, you may assign specific names as a character string up to 50 symbols for integral characteristics.

Characteristic description

To setup integral physico-chemical characteristics, use the format:

x₁(npos₁); x₂(npos₂); ...; x_n(npos_n);

x_i	Arbitrary numbers in a floating point format
npos_i	Corresponding positions of alignment enumerated in an arbitrary form (using ',' and '-' symbols), for example: (1-3,4,5,30-44) denotes positions from 1 to 5 and from 30 to 44.

Examples of characteristics:

Net value of a certain amino acid characteristic at the alignment positions 6-8 :

Net value

1.(6-8);

Projection of alpha helical momentum (for helix positions 1 to 5 ):

Helix Momentum

1.(1); -0.17(2); -0.94(3); 0.5(4); 0.77(5);

where cos(0° )=1; cos(100° )=-0.17; cos(200° )=-.94; ....

Output parameters

Two output data modes are allowed:

Text (default)	ASCII text file (with HTML header)
Graphic	Plots in GIF-format

ASCII-text format (convenient for further data analysis and their graphical representation by statistical packages (Excel, Statistica, etc.).

Output data include:

Means and variances of integral characteristics
If number of characteristics exceeds 1, pairwise correlation coefficients are represented for all possible pairs of characteristics
If number of characteristics exceeds 1, linear regression parameters are represented according to the linear functional relationship model
Values of characteristics for all proteins in a sample
Comparison of the dispersion values of characteristics between original data sets and expected by random model. Among these characteristics are: F_i dispersion values for original sample (ORIGINAL); calculated F_i dispersion values expected for independent substitution (RND_EXPECT); mean (RND_MEAN) and standard deviation (RND_STDEV) of the F_i dispersion values in simulated sequence samples; estimated probability that dispersion value for the sample with independent substitutions is greater than for the original sample ( P{F(RND)>F(ORIG)} ). This comparison allows to estimate the degree of constancy (variability) of particular integral characteristic.
For correlation coefficients between (F_i, F_j) in randomized samples, their mean, standard deviation values and the probability that correlation coefficient in random samples is greater by absolute value than in original sample (Correlation coefficients: Mean; Correlation coefficients: Std.Dev.; P{|Cor(rand)|>|Cor(orig)|} )
Histograms (in text format) for F_i dispersion (variance) values distribution in randomized samples (first column corresponds to D_rand(Fi), second, to frequency value)

Graphic output displays different types of plots, characterizing the distribution of integral characteristics both in original and simulated data samples. To select a plot, click the appropriate check-box.

F distribution in original set	These plots represent F_i values distribution in original protein sequence sample (as a histogram)
F_i vs F_j pairwise scatterplots	These plots represent (F_i, F_j) values mutual distribution in original protein sequence sample. Each point represents a pair of (F_i, F_j) characteristic values for particular protein in the sample (scatterplots)
D(F) distribution in randomized samples	These plots represent the F_i values dispersion distribution in simulated random samples (as a histogram)
D_exp(F)/D(F) distribution in randomized samples	These plots represent the ratio D_exp(F)/D(F) values distribution in simulated random samples (as a histogram)

CRASP main page