[Bioperl-l] New EMBL format parsing/writing

Chris Fields cjfields at uiuc.edu
Mon Jul 24 16:21:39 UTC 2006


The only proposed EMBL changes I can remember were for Tax data (organism
lines).  It shouldn't be hard to change the way these are parsed.

We could leave parsing of SV for older files and run a check on the ID line
format to accommodate old and new sequences, though I have no problem with
only supporting the latest formats.  Continual support for old deprecated
sequence formats leads to lots of cruft over time; SwissPort parsing has the
same issue.  You would be surprised how many people out there never bother
to update their sequences and use old data...  

I believe you are referring to this (from the latest EMBL release notes):

...

2 CHANGES IN THIS RELEASE

2.1 Changes to the Feature Table Document: Chapter 3.5 "Location"

The use of range (.) descriptor within location spans is no longer legal.

2.2 ID line changes

ID line structure underwent the following changes

    * All tokens are separated by a semicolon.
    * The entry name is not displayed, in its place there is the primary
accession number.
    * The sequence version is indicated.
    * The topology is a separate token and is indicated for both circular
and linear molecules.
    * Both the data class and taxonomic divisions will be displayed.


This is an example of the new ID line:

ID   CD789012; SV 4; linear; genomic DNA; HTG; MAM; 500 BP.
       (1)     (2)     (3)      (4)       (5)  (6)   (7)


The tokens represent:

   1. Primary accession number.
   2. 'SV' + sequence version number.
   3. Topology: 'circular' or 'linear'.
   4. Molecule type.
   5. Data class (ANN, CON, PAT, EST, GSS, HTC, HTG, MGA, WGS, TPA, STS,
STD, "normal" entries will have STD for standard).
   6. Taxonomic division (HUM, MUS, ROD, PRO, MAM, VRT, FUN, PLN, ENV, INV,
SYN, UNC, VRL, PHG).
   7. Sequence length + 'BP.'.


The entry name is no longer displayed in the ID line.
A mapping file (entryname to accession number)
ftp://ftp.ebi.ac.uk/pub/databases/embl/misc/entryname_to_acc.mapping is
provided for those entries where the entryname is not the same as the
accession number.

The SV line has been dropped as sequence version information is now
displayed in the ID line.

In order to facilitate the changeover to the new ID line structure, two
small utilities have been released: 'new2oldID.pl' and 'old2newID.pl'. They
can be used to convert EMBL flat files from the old to the new format and
vice-versa. The converters can be found at

ftp://ftp.ebi.ac.uk/pub/databases/embl/tools

A new version of the Syncron tools (for maintaining synchronised copies of
EMBL database updates) that became the working version with EMBL release 87
can be found in the same directory. In this version the tools were adjusted
to cope with the new format of the ID line in EMBL entries and some related
changes.

...


Chris


> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of simon andrews (BI)
> Sent: Monday, July 24, 2006 8:34 AM
> To: bioperl-l at lists.open-bio.org
> Subject: [Bioperl-l] New EMBL format parsing/writing
> 
> I few weeks ago I saw a couple of messages on this list mentioning the
> new ID/SV line format used in the latest EMBL release.  I'm in the
> process of moving our database server over to the new format and was
> looking to update SeqIO::embl.pm.
> 
> I'm sure someone said they'd made a patch to fix up parsing of the new
> format, but I can't find it either in CVS or bugzilla.
> 
> Rather than do this again myself can someone point me to an updated
> SeqIO::embl.pm please?  If there isn't one then I'll look into making
> the patch myself.
> 
> Since this is such a major change are there any plans to put out a new
> release with this fix included?  I'm sure this will start to bite more
> people as the new format becomes more widely adopted.
> 
> 
> Cheers
> 
> Simon.
> 
> --
> Simon Andrews PhD
> Bioinformatics Group
> The Babraham Institute
> 
> simon.andrews at bbsrc.ac.uk
> +44 (0) 1223 496463
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l




More information about the Bioperl-l mailing list