[Biopython-dev] Fwd: [Biopython] skipping a bad record read in SeqIO

Sun Jun 7 21:30:50 UTC 2009

On 6/7/09, Iddo Friedberg <idoerg at gmail.com> wrote:
> On Sun, Jun 7, 2009 at 1:10 PM, Peter
> <biopython at maubp.freeserve.co.uk> wrote:
> >
> > Could you report a bug with this particular GenBank file (or at least,
> > the entry). I think Biopython should try and cope with all valid
> > GenBank files.
> >
> > It has been a long time since I personally found a GenBank file
> > Biopython couldn't parse - the only cases I can remember recently from
> > the mailing list have been invalid files from 3rd party scripts or
tools.
> >
> > Sometimes for out of spec files issuing a warning but continuing may
> > be OK (we do already this on some LOCUS line variants, e.g. some
> > GenBank files output from EMBOSS), but for anything unexpected I
> > think the only safe option is to raise an exception.
> >
> >
> > >  Biopython cannot handle every record format variant (==error) out
> > >  there, and we should probably have a method for skipping over
> > >  illegible records. The records skipped should be noted, of course,
> > >  e.g. by writing to stderr. If the record cannot be read, then the
> > >  preceding record ID and / or the record serial number should be
> > >  written.
> > >
> > >  Does that sound like something we should be doing?
> >
> > No, not really.
> >
> > I'm not 100% sure this is what you meant, but I would oppose any
> > suggestion that the default behaviour should be to completely skip bad
> > records (with only a warning or output to stderr to signal this).
> >
> > In some cases (e.g. GenBank and SwissProt files) the start and end of
> > records are well defined, so for a corrupt record we may be able to
> > recover by issuing a warning and skipping ahead to the next record
> > boundary. In other file formats this could be impossible (or at least,
> > risky). So as a general policy for Bio.SeqIO, I don't think we can
> > offer any way to skip bad records.
> >
> > Perhaps I am biased as most GenBank files I personally use are single
> > records (i.e. genomes).
>
> No, I am not suggesting that it should be the default behavior,

OK, good. I was worried there.

> but that an argument (skip_bad_records=True or somesuch) could be
> passed to the parser to make this possible for users who would like to
> do that. I work with millions of sequences at a time, and if 5,000 or
> 50,000 are badly formatted (or problematic due to a parser bug), I
> would rather make a note of it and move on, coming back later to fix
> the problem. The alternative would be -- well, and ugly hack, which
> will cause loss of time and research momentum.
>
> Also, I am not suggesting an exact implementation (yet). Warnings
> do sound better than stderr.
>
> There are a few million genbank (the format) files out there that did not
> originate with NCBI genbank (the database). Mostly in metagenomics.
> Some are meta-file that contain no sequence but only LOCUS fields.
>
> It used to be that any format was strictly adhered to, simply because
> files in that format would always originate from the same source, and
> FASTA was the universal format used for exchange, since it is very
> hard to mess up a fasta format. That is not the case any more. For
> that reason I think we should consider how to handle unparse-able
> records.

OK, clearly you have a rather different use case to me, where almost
all the GenBank files I have used are from the NCBI, and if not are
usually single genomes (with draft annotations) where I am prepared
to fix any file format errors by hand.

If you haven't already done so I would urge you to report bad files
to the upstream source, other such errors are only perpetuated
and will cause more headaches in future.

You haven't convinced me that we need a general mechanism
(in Bio.SeqIO) for skipping bad records in any file format (and I
remain sceptical that this is even possible in general). However,
for your GenBank situation I can understand your motivation now.

I think in your situation I'd implement my earlier "hand waving"
suggestion of a pre-parser which breaks the big GenBank file
up into individual records, and turn each into a StringIO handle
passed to Bio.SeqIO insider a try/except. This would make a
nice cookbook recipe... I can picture the code in my head and
could probably get it working pretty quickly if you'd like to try
this. But probably tomorrow not tonight ;)

So, perhaps an option (GenBank/EMBL specific initially) could
be considered, but so far this seems like a corner use case to
me, which we shouldn't complicate the main code base to
accommodate.

Peter

-- 
Iddo Friedberg, Ph.D.
Atkinson Hall, mail code 0446
University of California, San Diego
9500 Gilman Drive
La Jolla, CA 92093-0446, USA
T: +1 (858) 534-0570
http://iddo-friedberg.org