SAMPLES format

Format Description (version 1.1)

1. The sample must be presented by a text file in ASCII format and

may have no more than 80 characters in length. No empty lines are allowed. Language: English.

2. The file must be consisted of two parts: the descriptor and the sequence sample itself.

3. The descriptor contains the following fields:

FI <sample_ident> Ž brief name of the sample;

NM <string> Ž name of the sample and explanation of its biological meaning (NaMe); the format is free (multiple lines are allowed);

OR <string> Ž title of the organization (ORganization);

AU First_name Last_name Ž the first and last name of the author (AUthor);

DA DD-MON-YYY Ž date of creation (DAte);

LU DD MON-YYYY; First_name Last_name Ž the date and the author of the last update/modification (Last update); in case of no modification, the date and the authors name are from the DA and AU fields;

FV N Ž Number of the Format Version;

ST {X,Y} [Left_border,Right_border] Point; Description; Factor_name. Ž (STructure) description of the site (not a particular site, but type of the sites); several ST fields are allowed;

X=0....9 Ž the number of site batch (sites are joined into batches if they in the aggregate they represent an integral structure-function unit);

Y=0....9 Ž the number of the site in a site batch;

[Left_border,Right_border] Ž The position of the site in case this particular site type possesses a fixed location relative to a specific point, for example, relative to transcription start. If no, put nothing;

Point Ž the point relative to which the site location is fixed; If the site has no fixed position, put nothing;

Point = Transcription_start or Translation_start, etc. (the list is to be expanded);

Description Ž conventional name of the site (abbreviated name if available);

Factor_name Ž conventional name of the factor binding with this site (abbreviated if available). If the factor is unknown, put nothing;

In the FT field in the body of the sample one has to describe the sites, that are defined in the appropiate ST field of the descriptor, their positions in the particular sequence, and the methods used for recognition. (For more information see the FT field description).

Example (E.Coli promoter description)

ST {0,1} [-44,-35] Transcription_start; -35-box; RNA-pol sigma subunit.

ST {0,3} [-14,-10] Transcription_start; TATA-box; RNA-pol sigma subunit.

Another example: glucocorticoid hormone binding site (GRE):

ST {0,0} ; GRE; GRE binding protein.

AD Ž (ADditional) format description of optional fields added by the author (if they are lacking, the AD field is absent; multiple lines are allowed);

***** Ž end of the descriptor (the next line starts the sample);

4. Each card of the sample includes the following fields (mandatory fields are given in bold):

key	description	number per entry	Coinciding with the EMBL format or not
ID	identification ; begins each entry	1	no
AC	accession number	1	no
DT	date	>=0	yes
DE	description	>=0	yes
KW	keyword	>=0	yes
OS	organism species	>=0	yes
OC	organism classification	>=0	yes
OG	organelle	>=0	yes
RN	reference number	>=0	yes
RC	reference comment	>=0	yes
RP	reference positions	>=0	yes
RX	reference cross-reference	>=0	yes
RA	reference author(s)	>=0	yes
RT	reference title	>=0	yes
RL	reference location	>=0	yes
DR	database cross-reference	>=1	no
FT	site description	>=1	no
CC	comments or notes	>=0	yes
SQ	sequence header	1	no
bb - (blanks)	sequence data	>=1	yes
//	termination line	1	yes

ID, AC, DR, FT, SQ, bb (blanks Ž sequence data), and //(termination line) are mandatory.

5. Interpreter of the Samples system recognizes only the mandatory fields, ignoring the rest.

6. Creating the sample, one can add his own fields. The name of the added field must be a two-character identifier distinct from the above-listed (descriptor and EMBL database identifiers). The format of the field is designed by the author. The field identifier and its format are described in the AD field.

7. Format description of the fields (formats of all fields except for ID, SQ, bb (blanks), DR, and FT coincide with those of the EMBL database):

ID entryname; molecule

molecule = DNA or RNA

AC Y00321

DT DD-Mon-YYYY (Rel. #, created)

DT DD-MON-YYYY (Rel. #, Last updated, Version #)

The dates of creation and modification; if the DT field is present, it is always present two times.

DE description

Gives the description of the sequence; the format is free.

KW keyword[; keyword...].

One (or several) keyword(s)

OS Genus species (name)

The Latin name of the species with its conventional English name, for example:

OS Homo sapiens (human)

OC Node[; Node...].

Taxonomic classification, for example:

OC Eukaryota; Planta; Phykophyta; Euglenophyceae

Several OC fields in succession are allowed

DR database_identifier; primary_identifier; secondary_identifier; release_number; posit-s

The name of the source database; primary identifier; secondary identifier, release number; and positions

	Database Fullname	Primary_identifier	Secondary_identifier
EMBL	EMBL database	AC-number	entry name (ID)
MEDLINE	MEDLINE literature database	MEDLINE ID	-
SWISS-PROT	SWISS-PROT Prot Seq Database	AC-number	entry name
SPTREMBL	SWISS-PROT TREMBL Database	AC-number
GDB	Human Genome Database
GDB	Human Genome Database
SGD	Saccharomyces Genome Database
MGD	Mouse Genome Database
TRANSFAC	Transcription Factor Database	AC-number	entry name
EPD	Eukaryotic Promoter Database	entry code	promoter name
FLYBASE	Drosophila Genetic Database	unique id	Gene symbol
CPGISLE	Cpg Islands Database	entry code	release number
IMGT/LIGM	Immunogenetics Database	AC-number	release number
AGIS	Agricultural Genome Information Server	AC-number	release number

positions: join(xxx..yyy,...,zzz..www)

Example:

DR EMBL; M24308; HSADH2E1; 39; join(100..200,300..400)

If positions have minus values [for instance, join(-200..-100)], it means that in SQ field there is a sequence, complementary to appropriate entry of the source database. For the above example the sequence is complementary to the fragment (100..200).

FT {X,Y} [left;right]; Method

X = 0...9 Ž number of the site batch
Y = 0...9 Ž number of the site in the site batch
[left;right] Ž positions from the start of the sequence
Method = EXP (experimental),
GBS (Gibbs Sampler alignment),
RCG (Recognition Group alignment), etc.; the list is to be specified and expanded.

SQ gcccagactggctagctagcagctcgatcgagctagctagcgatcgatgcatgctatgct
cgtagcgtactgctagtacgcgtagctcgtacgct

(the line contains 60 characters starting from the 6th position)

// (ends each entry)