Format Description (version 1.1)

EXAMPLE ENTRY

1. The sample must be presented by a text file in ASCII format and

may have no more than 80 characters in length. No empty lines are allowed. Language: English.

2. The file must be consisted of two parts: the descriptor and the sequence sample itself.

3. The descriptor contains the following fields:

FI <sample_ident> brief name of the sample;

NM <string> name of the sample and explanation of its biological meaning (NaMe); the format is free (multiple lines are allowed);

OR <string> title of the organization (ORganization);

AU First_name Last_name the first and last name of the author (AUthor);

DA DD-MON-YYY date of creation (DAte);

LU DD MON-YYYY; First_name Last_name the date and the author of the last update/modification (Last update); in case of no modification, the date and the authors name are from the DA and AU fields;

FV N Number of the Format Version;

ST {X,Y} [Left_border,Right_border] Point; Description; Factor_name. (STructure) description of the site (not a particular site, but type of the sites); several ST fields are allowed;

X=0....9 the number of site batch (sites are joined into batches if they in the aggregate they represent an integral structure-function unit);

Y=0....9 the number of the site in a site batch;

[Left_border,Right_border] The position of the site in case this particular site type possesses a fixed location relative to a specific point, for example, relative to transcription start. If no, put nothing;

Point the point relative to which the site location is fixed; If the site has no fixed position, put nothing;

Point = Transcription_start or Translation_start, etc. (the list is to be expanded);

Description conventional name of the site (abbreviated name if available);

Factor_name conventional name of the factor binding with this site (abbreviated if available). If the factor is unknown, put nothing;

In the FT field in the body of the sample one has to describe the sites, that are defined in the appropiate ST field of the descriptor, their positions in the particular sequence, and the methods used for recognition. (For more information see the FT field description).

Example (E.Coli promoter description)

ST {0,1} [-44,-35] Transcription_start; -35-box; RNA-pol sigma subunit.

ST {0,3} [-14,-10] Transcription_start; TATA-box; RNA-pol sigma subunit.

Another example: glucocorticoid hormone binding site (GRE):

ST {0,0} ; GRE; GRE binding protein.

AD (ADditional) format description of optional fields added by the author (if they are lacking, the AD field is absent; multiple lines are allowed);

***** end of the descriptor (the next line starts the sample);

4. Each card of the sample includes the following fields (mandatory fields are given in bold):

key

description

number per entry

Coinciding with the EMBL format or not

ID

identification ; begins each entry

1

no

AC

accession number

1

no

DT

date

>=0

yes

DE

description

>=0

yes

KW

keyword

>=0

yes

OS

organism species

>=0

yes

OC

organism classification

>=0

yes

OG

organelle

>=0

yes

RN

reference number

>=0

yes

RC

reference comment

>=0

yes

RP

reference positions

>=0

yes

RX

reference cross-reference

>=0

yes

RA

reference author(s)

>=0

yes

RT

reference title

>=0

yes

RL

reference location

>=0

yes

DR

database cross-reference

>=1

no

FT

site description

>=1

no

CC

comments or notes

>=0

yes

SQ

sequence header

1

no

bb - (blanks)

sequence data

>=1

yes

//

termination line

1

yes

ID, AC, DR, FT, SQ, bb (blanks sequence data), and //(termination line) are mandatory.

5. Interpreter of the Samples system recognizes only the mandatory fields, ignoring the rest.

6. Creating the sample, one can add his own fields. The name of the added field must be a two-character identifier distinct from the above-listed (descriptor and EMBL database identifiers). The format of the field is designed by the author. The field identifier and its format are described in the AD field.

7. Format description of the fields (formats of all fields except for ID, SQ, bb (blanks), DR, and FT coincide with those of the EMBL database):

ID entryname; molecule

molecule = DNA or RNA

AC Y00321

DT DD-Mon-YYYY (Rel. #, created)

DT DD-MON-YYYY (Rel. #, Last updated, Version #)

The dates of creation and modification; if the DT field is present, it is always present two times.

DE description

Gives the description of the sequence; the format is free.

KW keyword[; keyword...].

One (or several) keyword(s)

OS Genus species (name)

The Latin name of the species with its conventional English name, for example:

OS Homo sapiens (human)

OC Node[; Node...].

Taxonomic classification, for example:

OC Eukaryota; Planta; Phykophyta; Euglenophyceae

Several OC fields in succession are allowed

DR database_identifier; primary_identifier; secondary_identifier; release_number; posit-s

The name of the source database; primary identifier; secondary identifier, release number; and positions

Database Fullname

Primary_identifier

Secondary_identifier

EMBL

EMBL database

AC-number

entry name (ID)

MEDLINE

MEDLINE literature database

MEDLINE ID

-

SWISS-PROT

SWISS-PROT Prot Seq Database

AC-number

entry name

SPTREMBL

SWISS-PROT TREMBL Database

AC-number

GDB

Human Genome Database

GDB

Human Genome Database

SGD

Saccharomyces Genome Database

MGD

Mouse Genome Database

TRANSFAC

Transcription Factor Database

AC-number

entry name

EPD

Eukaryotic Promoter Database

entry code

promoter name

FLYBASE

Drosophila Genetic Database

unique id

Gene symbol

CPGISLE

Cpg Islands Database

entry code

release number

IMGT/LIGM

Immunogenetics Database

AC-number

release number

AGIS

Agricultural Genome Information Server

AC-number

release number

positions: join(xxx..yyy,...,zzz..www)

Example:

DR EMBL; M24308; HSADH2E1; 39; join(100..200,300..400)

If positions have minus values [for instance, join(-200..-100)], it means that in SQ field there is a sequence, complementary to appropriate entry of the source database. For the above example the sequence is complementary to the fragment (100..200).

FT {X,Y} [left;right]; Method

X = 0...9 number of the site batch
Y = 0...9 number of the site in the site batch
[left;right] positions from the start of the sequence
Method = EXP (experimental),
GBS (Gibbs Sampler alignment),
RCG (Recognition Group alignment), etc.; the list is to be specified and expanded.

SQ      gcccagactggctagctagcagctcgatcgagctagctagcgatcgatgcatgctatgct
            cgtagcgtactgctagtacgcgtagctcgtacgct

(the line contains 60 characters starting from the 6th position)

// (ends each entry)