[Biopython-dev] Bio.GenBank (was: Bio.File)

Sun Sep 11 03:22:15 UTC 2011

Hi all,

There are several issues here.
Let's talk about Bio.GenBank first.

I think it's OK to have a module Bio.GenBank in addition to Bio.SeqIO, but it's a bit unclear to me which code in Bio.GenBank is still relevant and which (if any) can potentially be deprecated. Also we'd need some documentation for Bio.GenBank. In particular it's not clear to me which classes in Bio.GenBank are intended to be used by users. The description at the top of Bio.GenBank says that only Bio.GenBank.RecordParser should be used directly. However, in the test code in  Bio.Graphics.GenomeDiagram (after "if name=='__main__':") Bio.GenBank.FeatureParser is used. Should that be replaced by Bio.SeqIO then?

Also I think that the RecordParser should raise an Exception if it cannot find a record when parsing. Compare the following:

>>> from Bio import SeqIO
>>> from StringIO import StringIO
>>> handle = StringIO("no record here")
>>> SeqIO.read(handle, 'fasta')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 617, in read
    raise ValueError("No records found in handle")
ValueError: No records found in handle
>>> from Bio import GenBank
>>> parser = GenBank.RecordParser()
>>> handle = StringIO("no record here")
>>> parser.parse(handle)
>>> # no error raised

This still lets us ignore header text before the actual start of a GenBank record; the error should only be raised if no GenBank record can be found anywhere.

Best,
--Michiel.

--- On Thu, 9/8/11, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> From: Peter Cock <p.j.a.cock at googlemail.com>
> Subject: Re: [Biopython-dev] Bio.File
> To: "Michiel de Hoon" <mjldehoon at yahoo.com>
> Cc: biopython-dev at biopython.org
> Date: Thursday, September 8, 2011, 11:25 AM
> On Thu, Sep 8, 2011 at 3:49 PM,
> Michiel de Hoon <mjldehoon at yahoo.com>
> wrote:
> >
> > No we shouldn't rely an HTTP return code. The idea is
> that only
> > the parser can know if the output returned by NCBI is
> valid, as in:
> >
> > handle = Entrez.efetch(...something...)
> > try:
> >    record = Entrez.read(handle)
> > raise Exception:
> >    # NCBI returned something invalid, or at least
> >    # something that we don't know how to parse
> 
> In theory, yes, but quite often parsers look for certain
> patterns and if you feed them something else they may
> just say "no data". For example, the GenBank parser
> ignores anything before the LOCUS line (in order to
> cope with the free text header in the large multi-record
> files on the NCBI FTP site). As a side effect, you can
> give it almost any plain text file and the parser won't
> raise an error - it will just say no GenBank records
> found.
> 
> >> If the server could be relied on to always give
> an
> >> HTTP error code this wouldn't be needed:
> >>
> >> https://github.com/peterjc/biopython/blob/togows/Bio/TogoWS/__init__.py
> >>
> >
> > I don't like this approach much, as it depends on
> exactly
> > what the error message looks like, and misses any
> other
> > problems, such as incomplete output. There will be a
> > certain false positive rate, with return values that
> pass
> > the checking of the first 10 lines but are still
> unusable.
> 
> Yes, in theory the server should detect and handle
> errors nicely - but there are sometimes bugs in web-
> services. Certainly from memory I have had HTTP
> return code 200 (OK) with invalid data from both the
> NCBI and TogoWS.
> 
> > Even worse, the false positive rate can suddenly go
> up
> > if the server maintainers decide to change anything
> in
> > their error messages.
> 
> The checks are deliberately designed to avoid false
> positives - at the cost of missing some errors early.
> 
> > This kind of checking should be
> > done by the parser, which can tell you exactly if the
> > data are valid, or if not, what is wrong with it.
> 
> That isn't always possible, since so many bioinformatics
> file formats are so vague that validation is hard.
> 
> I accept checking the first 10 lines for common errors
> specific to that webservice is inelegant, but it is
> practical.
> 
> [Some of those TogoWS checks are probably superfluous
> right now, I'm still polishing the error handling - some
> of
> which will rely on TogoWS itself catching more conditions]
> 
> Regards,
> 
> Peter
>