[Bioperl-l] EMBL release 87 format changes.
Chris Fields
cjfields at uiuc.edu
Wed Jul 19 21:46:43 UTC 2006
You can go ahead and submit the patch to Bugzilla anyway. Comments about
the proposed changes from the developers can be added there.
I think there's some confusion here, though: the EMBL SeqIO change you
mentioned I committed is actually for Bio::SeqIO::swiss (SwissProt). I
haven't touched Bio::SeqIO::embl (yet). 'swiss' format now reads old and
new swiss data files and writes only new format; no major changes have been
made to SeqIO::embl in about a year (and even that was a small one).
Chris
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of dwaner at scitegic.com
> Sent: Wednesday, July 19, 2006 2:48 PM
> To: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] EMBL release 87 format changes.
>
> BioPerl Users and Developers,
>
> I have updated the EMBL SeqIO parser to work correctly with Release 87 of
> EMBL (June 19th, 2006). As suggested by Chris Fields in an earlier
> message, the EMBL parser now reads both new and old formats, but only
> writes the new format.
>
> I don't think that my changes will affect most users, but if you are using
> the EMBL format can you review the changes described below and speak up if
> anything looks like it could create a problem for you?
>
> If I don't hear any objections soon, I will submit a patch to bugzilla.
>
> Thanks,
>
> - David
>
> Parser changes:
>
> - EMBL files no longer contain the "entry name". When reading old format
> files,
> the EMBL "entry name" from the ID line is used as the Bio::Seq::id and
> Bio::Seq::display_id, but when reading new format files, the accession
> number
> is used for these fields.
>
> Changes to output:
>
> - The ID line was changed to the new format.
>
> - The SV line is never written; SV is now part of the ID line.
>
> - "DNA" and "RNA" are no longer valid EMBL molecule types. They are now
> written
> as "unassigned DNA" and "unassigned RNA"
>
> - Strictly speaking, EMBL format should only be used for nucleotide
> sequences.
> If the alphabet is 'protein', write_seq() emits a warning and writes the
>
> non-standard molecule type "AA" in the ID line.
>
> - Because BioPerl sequences do not have a "data class" attribute, all
> sequences
> are written with a data class of "STD" in the ID line.
>
> - The ID line contains the Bio::Seq::accession, unless it is missing, in
> which
> case the Bio::Seq::id is used.
>
> - molecule type is strictly validated. Non-EMBL values are output as
> "unassigned DNA" or "unassigned RNA", depending on the sequence
> alphabet.
>
> - "taxonomic division" is strictly validated. Non-EMBL values are output
> as "UNC".
>
> - The taxonomic division code "UNK" is now written as "UNC"
> (unclassified).
>
> Possible Gotchas for some users:
>
> - Because the EMBL entry name is no longer included anywhere in the file,
> when round-tripping from old format to new format the entry name will be
> lost.
>
> - In order to ensure that BioPerl writes valid EMBL files, I have added
> strict
> validation to the writer for "molecule type" and "taxonomic division".
> This
> could present a problem for users who are using non-standard values for
> these
> fields, but I felt it was important to write files that adhere to the
> EMBL spec.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list