[Biopython-dev] Parsing malformed genbank files (e.g. VectorNTI)

Peter biopython at maubp.freeserve.co.uk
Thu Dec 4 10:26:39 UTC 2008


On Wed, Dec 3, 2008 at 12:19 AM, Timothy Ham <timothyham at gmail.com> wrote:
>
> Hi everyone,
>
> The current biopython GenBank parser dies while parsing VectorNTI
> generated files.  For example, until recently, BioPython did not
> accept an empty SOURCE field. It still does not handle an empty
> VERSION or ACCESSION fields (consumer.data.id never gets filled),
> which is the default for user generated vector maps via VectorNTI.

I fixed the SOURCE issue in Bio/GenBank/__init__.py CVS revision 1.97
after Tim contacted me offlist - there was no bug report.

> Now, it is easy enough to change the GenBank parser to handle
> malformed genbank files, (I can submit patches) but the real question
> becomes:
>> Should BioPython handle malformed genbank files at all?
> I would like to be practical and say yes, since VectorNTI is a very
> common, widely used format, but I wanted to ask the community before
> submitting my patches.
>
> Thanks for the great work,
> Tim

As I'm the defacto maintainer for Bio.GenBank, I guess unless the list
as a whole has a consensus this is my call.

Reading the GenBank file format spec, the ACCESSION and VERSION lines
are clearly intended to be mandatory.  Note that for mandatory fields,
IIRC, the NCBI will use a single dot/period as a place holder when
there is no data.  So I would argue that VectorNTI is producing
invalid files, and you should write to the authors and encourage them
to follow the spec more closely (even if we do change Biopython to
cope).

However, I'm willing to bend a little on out of spec GenBank files (in
cases like this where there is no ambiguity about the parsing), but I
would want a real example output file from VectorNTI to include for a
unit test.  This is important as we need to use something sensible for
the SeqRecord's id property if the ACCESSION and VERSION are missing.

Peter



More information about the Biopython-dev mailing list