[Biopython-dev] Parsing malformed genbank files (e.g. VectorNTI)

Peter biopython at maubp.freeserve.co.uk
Thu Dec 4 15:16:20 UTC 2008


On Thu, Dec 4, 2008 at 3:02 PM, Bruce Southey <bsouthey at gmail.com> wrote:
> Peter wrote:
>> Reading the GenBank file format spec, the ACCESSION and VERSION lines
>> are clearly intended to be mandatory.  Note that for mandatory fields,
>> IIRC, the NCBI will use a single dot/period as a place holder when
>> there is no data.  So I would argue that VectorNTI is producing
>> invalid files, and you should write to the authors and encourage them
>> to follow the spec more closely (even if we do change Biopython to
>> cope).

Bruce wrote:
> At http://www.ncbi.nlm.nih.gov/Genbank/index.html there is a link to the
> 'complete release notes for the current version of GenBank'.
> From ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt, it clearly states that
> ACCESSION and VERSION are mandatory  ...

We agree on this, according to the current NCBI standard, a GenBank
file missing the ACCESSION or VERSION line is technically invalid.

Bruce:
> If these entries are missing then Biopython must raise an exception because
> the GenBank file is invalid.

I see a difference between a GenBank parser, and a GenBank validator.
While it would be nice to just say "your file is invalid", in many
cases the meaning of the file isn't ambiguous and can still be safely
parsed.  From past experience, even the NCBI sometimes provide invalid
files which break their own rules (e.g. Biopython Bug 2591).  In my
personal opinion, a strict parser which rejects any invalid GenBank
file isn't actually that useful - there is a grey area where a little
leniency is very helpful:

Peter wrote:
>> However, I'm willing to bend a little on out of spec GenBank files (in
>> cases like this where there is no ambiguity about the parsing), but I
>> would want a real example output file from VectorNTI to include for a
>> unit test.  This is important as we need to use something sensible for
>> the SeqRecord's id property if the ACCESSION and VERSION are missing.

Peter



More information about the Biopython-dev mailing list