[Bioperl-l] GenBank/EMBL species parsing problems
Chris Fields
cjfields at uiuc.edu
Sat Aug 26 03:50:39 UTC 2006
When I was working on bug 2077, I noticed that several EMBL files do not
convert properly to GenBank files. Basically, I found the problem comes
down to differences between the way EMBL and GenBank parsers handles species
information in the two formats (esp the way they both store names in
Bio::Species common_name(). For instance, SeqIO::embl uses the real common
name, while SeqIO::genbank uses the entire SOURCE line (bad!). As the
common name isn't always present for EMBL sequences, this chokes
SeqIO::genbank's write_seq().
I plan on updating GenBank parsing to be more consistent in SeqIO::genbank,
but I think the output should resemble the current GenBank release version
if possible. I already have added a few changes that pick up the organelle
and common name, if they are present. The proper write_seq() addition would
just rebuild that. Does anyone have a problem with that?
If not I'll go ahead and commit these in the next day or so and modify/add
tests accordingly.
Christopher Fields
Postdoctoral Researcher - Switzer Lab
Dept. of Biochemistry
University of Illinois Urbana-Champaign
More information about the Bioperl-l
mailing list