[Bioperl-l] Bug in genbank parsing: CONTIG gaps

Michael Rogoff miker at biotiquesystems.com
Wed May 3 03:31:59 UTC 2006


I've encountered a pretty serious bug in Bio::SeqIO when parsing certain genbank
files that contain CONTIG entries with gaps.  One such record is NW_925173.

When I try to parse this file using Bio::SeqIO::genbank, it will enter an
infinite loop and spin until it runs out of memory.  

I'm pretty certain it relates to this bug:
http://bugzilla.bioperl.org/show_bug.cgi?id=1319 which seems to indicate that
genbank records with CONTIG gaps are not valid and can't be parsed.  But this
bug actually claims to be fixed, which is strange, since looking at the code for
FTLocationFactory (where the loop is) it's still right there.  I assume that
this may be fixed in other contexts but is still not fixed in
Bio::SeqIO::genbank?  Or am I doing something wrong?

I think that this should probably be filed as an open bug.  I would think that
even if bioperl isn't interested in parsing this type of file via SeqIO,
certainly you'd want to ensure that no finite input file would send the parser
into an infinite loop.  Have others encountered this problem?  Is there any plan
to address it?

Thanks very much for any information or help!

-Mike

P.S.  I've played around with my version of FTLocationFactory and it seems to
actually work and parse the gaps.  I'm not sure if I've created other bugs or if
it works in all cases, but at least the parser doesn't die.  I also don't know
that my hacky code is appropriate for putting back in to BioPerl, but I'm happy
to provide it if someone wants to check it out and/or consider it for checkin.






More information about the Bioperl-l mailing list