[Bioperl-l] GenBank/EMBL species parsing problems

Chris Fields cjfields at uiuc.edu
Sat Aug 26 03:50:39 UTC 2006


When I was working on bug 2077, I noticed that several EMBL files do not
convert properly to GenBank files.  Basically, I found the problem comes
down to differences between the way EMBL and GenBank parsers handles species
information in the two formats (esp the way they both store names in
Bio::Species common_name().  For instance, SeqIO::embl uses the real common
name, while SeqIO::genbank uses the entire SOURCE line (bad!).  As the
common name isn't always present for EMBL sequences, this chokes
SeqIO::genbank's write_seq(). 

I plan on updating GenBank parsing to be more consistent in SeqIO::genbank,
but I think the output should resemble the current GenBank release version
if possible.  I already have added a few changes that pick up the organelle
and common name, if they are present.  The proper write_seq() addition would
just rebuild that.  Does anyone have a problem with that?

If not I'll go ahead and commit these in the next day or so and modify/add
tests accordingly.

Christopher Fields
Postdoctoral Researcher - Switzer Lab
Dept. of Biochemistry
University of Illinois Urbana-Champaign 





More information about the Bioperl-l mailing list