[Biojava-dev] Forthcoming change in the EMBL database

mark.schreiber at novartis.com mark.schreiber at novartis.com
Tue May 23 02:18:58 UTC 2006


Hi Richard -

Can you be in charge of future proofing the biojavax embl format object to 
cope with this?

Thanks.

- Mark





Carola Kanz <ckanz at ebi.ac.uk>
Sent by: biojava-dev-bounces at lists.open-bio.org
04/26/2006 11:00 PM

 
        To:     biojava-dev at biojava.org
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-dev] Forthcoming change in the EMBL database



Dear colleagues,

We would like to announce the following important change in the EMBL 
database in June this year.

At the time of release 87 (available from JUN-2006) the format of the 
EMBL flat file will undergo a change: the ID line will have a different 
structure (see below) and the SV line will be removed.

The changes affecting the ID line structure are:

     * All tokens will be separated by a semicolon.
     * The entry name will not be displayed, in its place there will be 
       the primary accession number.
     * The sequence version will be indicated.
     * The topology will be a separate token and will be indicated for 
       both circular and linear molecules.
     * Both the data class and the taxonomic divisions will be displayed.

This is an example of the new ID line:

ID   CD789012; SV 4; linear; genomic DNA; HTG; MAM; 500 BP.
        (1)     (2)     (3)      (4)       (5)  (6)   (7)


The tokens represent:

    1. Primary accession number.
    2. 'SV' + sequence version number.
    3. Topology: 'circular' or 'linear'.
    4. Molecule type.
    5. Data class (ANN, CON, PAT, EST, GSS, HTC, HTG, MGA, WGS, TPA, 
       STS, STD, "normal" entries will have STD for standard).
    6. Taxonomic division (HUM, MUS, ROD, PRO, MAM, VRT, FUN, PLN, ENV, 
       INV, SYN, UNC, VRL, PHG)."
    7. Sequence length + 'BP.'.

The entry name will not be displayed any more in the ID line. Since EMBL 
release 3 (Dec 1983) the stable identifier of an entry has been the 
primary accession number.

A mapping file (entryname to accession number) will be provided with the
next release for those entries where the entryname doesn't coincide with 
the accession number.

To give users a test dataset, one file with new-style ID lines called 
new_id_line.test.gz was provided together with the March release of the 
EMBL database: 
ftp://ftp.ebi.ac.uk/pub/databases/embl/release/new_id_line.test.gz 

Feedback from users is sought; please use the "Contact us" link at the 
bottom of the EBI home page and specify "EMBL" in the feedback form.

Note: this information was first made available on our
"Forthcoming changes" page ( 
http://www.ebi.ac.uk/embl/Documentation/forthcomingchanges.html#0606 ) 
and in the EMBL database release notes.

Regards,
Carola Kanz
EMBL database





_______________________________________________
biojava-dev mailing list
biojava-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-dev






More information about the biojava-dev mailing list