Format Description (version 1.1)
1. The sample must be presented by a text file in ASCII format and
may have no more than 80 characters in length. No empty lines are allowed. Language: English.
2. The file must be consisted of two parts: the descriptor and the sequence sample itself.
3. The descriptor contains the following fields:
FI <sample_ident> ® brief name of the sample;
NM <string> ® name of the sample and explanation of its biological meaning (NaMe); the format is free (multiple lines are allowed);
OR <string> ® title of the organization (ORganization);
AU First_name Last_name ® the first and last name of the author (AUthor);
DA DD-MON-YYY ® date of creation (DAte);
LU DD MON-YYYY; First_name Last_name ® the date and the author of the last update/modification (Last update); in case of no modification, the date and the authors name are from the DA and AU fields;
FV N ® Number of the Format Version;
ST {X,Y} [Left_border,Right_border] Point; Description; Factor_name. ® (STructure) description of the site (not a particular site, but type of the sites); several ST fields are allowed;
X=0....9 ® the number of site batch (sites are joined into batches if they in the aggregate they represent an integral structure-function unit);
Y=0....9 ® the number of the site in a site batch;
[Left_border,Right_border] ® The position of the site in case this particular site type possesses a fixed location relative to a specific point, for example, relative to transcription start. If no, put nothing;
Point ® the point relative to which the site location is fixed; If the site has no fixed position, put nothing;
Point = Transcription_start or Translation_start, etc. (the list is to be expanded);
Description ® conventional name of the site (abbreviated name if available);
Factor_name ® conventional name of the factor binding with this site (abbreviated if available). If the factor is unknown, put nothing;
In the FT field in the body of the sample one has to describe the sites, that are defined in the appropiate ST field of the descriptor, their positions in the particular sequence, and the methods used for recognition. (For more information see the FT field description).
Example (E.Coli promoter description)
ST {0,1} [-44,-35] Transcription_start; -35-box; RNA-pol sigma subunit.
ST {0,3} [-14,-10] Transcription_start; TATA-box; RNA-pol sigma subunit.
Another example: glucocorticoid hormone binding site (GRE):
ST {0,0} ; GRE; GRE binding protein.
AD ® (ADditional) format description of optional fields added by the author (if they are lacking, the AD field is absent; multiple lines are allowed);
***** ® end of the descriptor (the next line starts the sample);
4. Each card of the sample includes the following fields (mandatory fields are given in bold):
key |
description |
number per entry |
Coinciding with the EMBL format or not |
ID |
identification ; begins each entry | 1 |
no |
AC |
accession number | 1 |
no |
DT |
date | >=0 |
yes |
DE |
description | >=0 |
yes |
KW |
keyword | >=0 |
yes |
OS |
organism species | >=0 |
yes |
OC |
organism classification | >=0 |
yes |
OG |
organelle | >=0 |
yes |
RN |
reference number | >=0 |
yes |
RC |
reference comment | >=0 |
yes |
RP |
reference positions | >=0 |
yes |
RX |
reference cross-reference | >=0 |
yes |
RA |
reference author(s) | >=0 |
yes |
RT |
reference title | >=0 |
yes |
RL |
reference location | >=0 |
yes |
DR |
database cross-reference | >=1 |
no |
FT |
site description | >=1 |
no |
CC |
comments or notes | >=0 |
yes |
SQ |
sequence header | 1 |
no |
bb - (blanks) |
sequence data | >=1 |
yes |
// |
termination line |
1 |
yes |
ID, AC, DR, FT, SQ, bb (blanks ® sequence data), and //(termination line) are mandatory.
5. Interpreter of the Samples system recognizes only the mandatory fields, ignoring the rest.
6. Creating the sample, one can add his own fields. The name of the added field must be a two-character identifier distinct from the above-listed (descriptor and EMBL database identifiers). The format of the field is designed by the author. The field identifier and its format are described in the AD field.
7. Format description of the fields (formats of all fields except for ID, SQ, bb (blanks), DR, and FT coincide with those of the EMBL database):
ID entryname; molecule
molecule = DNA or RNA
AC Y00321
DT DD-Mon-YYYY (Rel. #, created)
DT DD-MON-YYYY (Rel. #, Last updated, Version #)
The dates of creation and modification; if the DT field is present, it is always present two times.
DE description
Gives the description of the sequence; the format is free.
KW keyword[; keyword...].
One (or several) keyword(s)
OS Genus species (name)
The Latin name of the species with its conventional English name, for example:
OS Homo sapiens (human)
OC Node[; Node...].
Taxonomic classification, for example:
OC Eukaryota; Planta; Phykophyta; Euglenophyceae
Several OC fields in succession are allowed
DR database_identifier; primary_identifier; secondary_identifier; release_number; posit-s
The name of the source database; primary identifier; secondary identifier, release number; and positions
Database Fullname |
Primary_identifier |
Secondary_identifier |
|
EMBL |
EMBL database |
AC-number |
entry name (ID) |
MEDLINE |
MEDLINE literature database |
MEDLINE ID |
- |
SWISS-PROT |
SWISS-PROT Prot Seq Database |
AC-number |
entry name |
SPTREMBL |
SWISS-PROT TREMBL Database |
AC-number |
|
GDB |
Human Genome Database |
||
GDB |
Human Genome Database |
||
SGD |
Saccharomyces Genome Database |
||
MGD |
Mouse Genome Database |
||
TRANSFAC |
Transcription Factor Database |
AC-number |
entry name |
EPD |
Eukaryotic Promoter Database |
entry code |
promoter name |
FLYBASE |
Drosophila Genetic Database |
unique id |
Gene symbol |
CPGISLE |
Cpg Islands Database |
entry code |
release number |
IMGT/LIGM |
Immunogenetics Database |
AC-number |
release number |
AGIS |
Agricultural Genome Information Server |
AC-number |
release number |
positions: join(xxx..yyy,...,zzz..www)
Example:
DR EMBL; M24308; HSADH2E1; 39; join(100..200,300..400)
If positions have minus values [for instance, join(-200..-100)], it means that in SQ field there is a sequence, complementary to appropriate entry of the source database. For the above example the sequence is complementary to the fragment (100..200).
FT {X,Y} [left;right]; Method
X = 0...9 ® number of the site batch
Y = 0...9 ® number of the site in the site batch
[left;right] ® positions from the start of the sequence
Method = EXP (experimental),
GBS (Gibbs Sampler alignment),
RCG (Recognition Group alignment),
etc.; the list is to be specified and expanded.
SQ
gcccagactggctagctagcagctcgatcgagctagctagcgatcgatgcatgctatgct
cgtagcgtactgctagtacgcgtagctcgtacgct
(the line contains 60 characters starting from the 6th position)
// (ends each entry)