[BioPython] Bug in Bio.GenBank.index_file()

Jan T. Kim kim@inb.mu-luebeck.de
Tue, 15 May 2001 17:51:28 +0200


Hi all,

we have come across a bug in the Bio.GenBank.index_file() function. The
effect is that large GenBank files (larger than 100000 bytes, see below)
are not indexed correctly. As a consequence, attempts to access records
result in "random" results, i.e. random chunks of the file are returned.
Almost needless to say that parsers frequently dislike this kind of
input...

The problem is that Bio.GenBank.index_file() directly accesses the
positions member of a Martel.RecordReader, apparently assuming to find
file positions of record starts there. However, this assumption is
not warranted, the positions list holds coordinates of record starts
within chunks read from a file rather than file positions. The assumption
happens to work out correctly only for the first chunk read from a file.
The chunk size is controlled by the Martel.SIZEHINT variable which is
set to 100000. So, when this limit is exceeded, file coordinates and
chunk coordinates eventually diverge.

I would offer a fix for the bug, but I am not sure how to do this,
The direct access to the positions list of an instance of a class
belonging to another module is a hack that bypasses modularity. The fact
that this hack was committed probably indicates need for an additional
interface of the Martel.RecordReader class, i.e. something like a
current_filepos() method for accessing the information that
Bio.GenBank.index_file() needs. But while this should be fairly easy
to write for Unix & Co. only, I'm concerned that newline substitutions
done in other operating systems might become an additional source of
discrepancies between chunk and file coordinates.

I'd be grateful for any comments on this.

Greetinx, Jan
-- 
 +- Jan T. Kim -------------------------------------------------------+
 |  *NEW* -->  email: kim@inb.mu-luebeck.de                           |
 |  *NEW* -->  WWW:   http://www.inb.mu-luebeck.de/staff/kim.html     |
 *-----=<  hierarchical systems are for files, not for humans  >=-----*