[BioPython] Genbank LOCUS line slightly misaligned

Thu Dec 18 15:15:07 UTC 2008

On Thu, Dec 18, 2008 at 1:47 PM, Peter Saffrey <pzs at dcs.gla.ac.uk> wrote:
> I have a genbank file sent to my lab from a company called Genomatrix. It is
> slightly misformed.

Oh dear.  Parsing misformed files is difficult as often they can be
interpreted in more than one way.  In general, the only safe and
explicit choice here is to throw an exception - although we do
tolerate some minor deviations from the spec in places.

> Specifically, the LOCUS lines have the right features, but not quite
> aligned; for example, the "bp" marker is not always at exactly the positions
> ([29:33] and [40:44]) required by _feed_first_line() in
> $biopythonhome/Genbank/Scanner.py.

The fact we allow for the "bp" (or "aa") marker in two places reflects
two iterations of the GenBank standard.  In theory we could remove the
support for the older version but there may be third party tools still
producing GenBank files using that style.

> Have Genomatrix made an error in producing these genbank files, or should
> the bioptyon routines accommodate these variations?

I presume Genomatrix have made an error - try emailing them for
clarification.  The GenBank file format for the LOCUS line is very
explicit and uses very precise column positions for the fields.

In theory we could try parsing ambiguous files using spaces to split
up the fields, but as many of the fields are optional, this isn't
generally possible without a little guess work.

> Some lines just give warnings and plough on, but others report that
> there isn't a space in exactly the right place and fail to read the record
> at all. I'm having to hack the genbank file as we speak...

I suspect that they (Genomatrix) are inserting a large locus
identifier into the beginning of the LOCUS line which is sometimes
bigger than the allocated slot, pushing the rest of the fields out of
position in some of the files.  I'd need to see several examples to be
confident about this guess.

If you don't actually need much information from the LOCUS line, you
might find it easier to hack our parser to be a little more tolerant -
I would suggest simply pulling out the locus ID, ignoring the rest of
the LOCUS line, and printing a warning.

Peter

P.S. Which version of Biopython are you using?  Biopython 1.48 onwards
is a little less fussy than Biopython 1.47 in order to accept GenBank
files produced by EMBOSS seqret.