[Biopython-dev] [Bug 1680] Problems with the GenBank indexing

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Sat Dec 24 07:18:47 EST 2005


http://bugzilla.open-bio.org/show_bug.cgi?id=1680





------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2005-12-24 07:18 -------
I think (after following references through several files) that we need to
focus on Bio/expressions/genbank.py

The "record" definition appears to allow multiple trailing blank lines at the
end of a record, see "record_end".  i.e. It looks for // and then one or more
new lines.

However, the "format" definition which appears to be used to build the index is
this:

format = Martel.ParseRecords("genbank", {"format" : "genbank"},
                             record, RecordReader.EndsWith, ("//",))

If I am not mistaken the for files with blank lines between records (as
reported in this bug), this will lead to the first record with no trailing
lines, and then subsequent records would have leading blank lines.

So, my suggestions are:

(a) Allow blank lines at the start of a genbank record (before the LOCUS line)

Or:

(b) we could try this:

format = Martel.ParseRecords("genbank", {"format" : "genbank"},
                             record, RecordReader.StartsWith, ("LOCUS ",))


Making this change seems to fix this bug (indexing the small 6 KB GenBank file
with three entries, takes under a second).

As the GenBank.Iterator code works by looking for records that start LOCUS,
this seems like a more consistent approach.

NOTE - I have not run the full test suite to look for any side effects.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


More information about the Biopython-dev mailing list