[Bioperl-l] EMBL release 87 format changes.

Wed Jul 19 19:47:58 UTC 2006

BioPerl Users and Developers,

I have updated the EMBL SeqIO parser to work correctly with Release 87 of 
EMBL (June 19th, 2006). As suggested by Chris Fields in an earlier 
message, the EMBL parser now reads both new and old formats, but only 
writes the new format. 

I don't think that my changes will affect most users, but if you are using 
the EMBL format can you review the changes described below and speak up if 
anything looks like it could create a problem for you?

If I don't hear any objections soon, I will submit a patch to bugzilla.

Thanks,

- David

Parser changes:

- EMBL files no longer contain the "entry name".  When reading old format 
files, 
  the EMBL "entry name" from the ID line is used as the Bio::Seq::id and 
  Bio::Seq::display_id, but when reading new format files, the accession 
number 
  is used for these fields. 

Changes to output:

- The ID line was changed to the new format. 

- The SV line is never written; SV is now part of the ID line.

- "DNA" and "RNA" are no longer valid EMBL molecule types. They are now 
written 
  as "unassigned DNA" and "unassigned RNA"

- Strictly speaking, EMBL format should only be used for nucleotide 
sequences. 
  If the alphabet is 'protein', write_seq() emits a warning and writes the 

  non-standard molecule type "AA" in the ID line.

- Because BioPerl sequences do not have a "data class" attribute, all 
sequences 
  are written with a data class of "STD" in the ID line.

- The ID line contains the Bio::Seq::accession, unless it is missing, in 
which 
  case the Bio::Seq::id is used.

- molecule type is strictly validated.  Non-EMBL values are output as 
  "unassigned DNA" or "unassigned RNA", depending on the sequence 
alphabet. 

- "taxonomic division" is strictly validated.  Non-EMBL values are output 
as "UNC". 

- The taxonomic division code "UNK" is now written as "UNC" 
(unclassified).

Possible Gotchas for some users:

- Because the EMBL entry name is no longer included anywhere in the file, 
  when round-tripping from old format to new format the entry name will be 
lost.

- In order to ensure that BioPerl writes valid EMBL files, I have added 
strict 
  validation to the writer for "molecule type" and "taxonomic division". 
This 
  could present a problem for users who are using non-standard values for 
these 
  fields, but I felt it was important to write files that adhere to the 
EMBL spec.