[Biopython-dev] skipping a bad record read in SeqIO

Sun Jun 7 21:17:48 UTC 2009

Here is the stack dump, coming from the file:

ftp://ftp.ncbi.nih.gov/genbank/gbcon11.seq.gz

The offender:

ACCESSION   CH991540 ABGB01000000

Syntax error at or near `Tokens('close_paren')' token
Traceback (most recent call last):
  File "./filter_seqs.py", line 108, in <module>
    matching_seqs, non_matching_seqs = filter_sequences(open(inpath),
match_pairs, condition,seq_format)
  File "./filter_seqs.py", line 23, in filter_sequences
    for seq_record in SeqIO.parse(in_handle,format):
  File "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/Scanner.py", line 420,
in parse_records
    record = self.parse(handle)
  File "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/Scanner.py", line 403,
in parse
    if self.feed(handle, consumer) :
  File "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/Scanner.py", line 381,
in feed
    self._feed_misc_lines(consumer, misc_lines)
  File "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/Scanner.py", line 1138,
in _feed_misc_lines
    consumer.contig_location(contig_location)
  File "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/__init__.py", line 987,
in contig_location
    self.location(content)
  File "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/__init__.py", line 684,
in location
    raise LocationParserError(location_line)
Bio.GenBank.LocationParserError:
join(complement(ABGB01000004.1:1..81568),gap(unk100),complement(ABGB01000012.1:1..1260),gap(unk100),ABGB01000013.1:1..1227,gap(unk100),ABGB01000011.1:1..1338,gap(unk100),complement(ABGB01000001.1:1..118303))

On Sun, Jun 7, 2009 at 2:14 PM, Iddo Friedberg <idoerg at gmail.com> wrote:

>
>
>
>
>
>
> On Sun, Jun 7, 2009 at 1:10 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:
>
>> On 6/7/09, Iddo Friedberg <idoerg at gmail.com> wrote:
>> > Thanks Peter.
>> >
>> >  OK, it's a genbank file, but the point is not hacking around that
>> problem
>> >  (which I did), it's more of a biopython policy question.
>>
>> Could you report a bug with this particular GenBank file (or at least, the
>> entry). I think Biopython should try and cope with all valid GenBank
>> files.
>>
>> It has been a long time since I personally found a GenBank file
>> Biopython couldn't parse - the only cases I can remember recently from
>> the mailing list have been invalid files from 3rd party scripts or tools.
>>
>> Sometimes for out of spec files issuing a warning but continuing may
>> be OK (we do already this on some LOCUS line variants, e.g. some
>> GenBank files output from EMBOSS), but for anything unexpected I
>> think the only safe option is to raise an exception.
>>
>> >  Biopython cannot handle every record format variant (==error) out
>> there,
>> >  and we should probably have a method for skipping over illegible
>> records.
>> >  The records skipped should be noted, of course, e.g. by writing to
>> stderr.
>> >  If the record cannot be read, then the preceding record ID and / or the
>> >  record serial number should be written.
>> >
>> >  Does that sound like something we should be doing?
>>
>> No, not really.
>>
>> I'm not 100% sure this is what you meant, but I would oppose any
>> suggestion that the default behaviour should be to completely skip bad
>> records (with only a warning or output to stderr to signal this).
>>
>> In some cases (e.g. GenBank and SwissProt files) the start and end of
>> records are well defined, so for a corrupt record we may be able to
>> recover by issuing a warning and skipping ahead to the next record
>> boundary. In other file formats this could be impossible (or at least,
>> risky). So as a general policy for Bio.SeqIO, I don't think we can
>> offer any way to skip bad records.
>>
>> Perhaps I am biased as most GenBank files I personally use are single
>> records (i.e. genomes).
>
>
> No, I am not suggesting that it should be the default behavior, but that an
> argument (skip_bad_records=True or somesuch) could be passed to the parser
> to make this possible for users who would like to do that. I work with
> millions of sequences at a time, and if 5,000 or 50,000 are badly formatted
> (or problematic due to a parser bug), I would rather make a note of it and
> move on, coming back later to fix the problem. The alternative would be --
> well, and ugly hack, which will cause loss of time and research momentum.
>
> Also, I am not suggesting an exact implementation (yet). Warnings do sound
> better than stderr.
>
> There are a few million genbank (the format) files out there that did not
> originate with NCBI genbank (the database). Mostly in metagenomics. Some are
> meta-file that contain no sequence but only LOCUS fields.
>
> It used to be that any format was strictly adhered to, simply because files
> in that format would always originate from the same source, and FASTA was
> the universal format used for exchange, since it is very hard to mess up a
> fasta format. That is not the case any more. For that reason I think we
> should consider how to handle unparse-able records.
>
>
>
>
>
>>
>>
>> Peter
>>
>> P.S. I would use the warnings module rather than writing to stderr, as
>> this would allow the user to filter warnings, upgrade them to
>> exceptions etc.
>>
>
>
>
> --
> Iddo Friedberg, Ph.D.
> Atkinson Hall, mail code 0446
> University of California, San Diego
> 9500 Gilman Drive
> La Jolla, CA 92093-0446, USA
> T: +1 (858) 534-0570
> http://iddo-friedberg.org
>
>
>
>
> --
> Iddo Friedberg, Ph.D.
> Atkinson Hall, mail code 0446
> University of California, San Diego
> 9500 Gilman Drive
> La Jolla, CA 92093-0446, USA
> T: +1 (858) 534-0570
> http://iddo-friedberg.org
>
>

-- 
Iddo Friedberg, Ph.D.
Atkinson Hall, mail code 0446
University of California, San Diego
9500 Gilman Drive
La Jolla, CA 92093-0446, USA
T: +1 (858) 534-0570
http://iddo-friedberg.org