[BioPython] Bug in Bio.GenBank.index_file()
Brad Chapman
chapmanb@arches.uga.edu
Tue, 15 May 2001 21:15:47 -0400
--s3x0BFOTiF
Content-Type: text/plain; charset=us-ascii
Content-Description: message body text
Content-Transfer-Encoding: 7bit
Hi Jan;
Jan:
> >we have come across a bug in the Bio.GenBank.index_file() function.
Yes, I have also been fighting with this for a while. I know the way I
did things was an ugly hack, and it was also failing for me on large
files. I hadn't been able to figure out what I couldn't reproduce it
with a smaller file, but now:
Jan:
> >The problem is that Bio.GenBank.index_file() directly accesses the
> >positions member of a Martel.RecordReader, apparently assuming to find
> >file positions of record starts there.
Andrew:
> Ahhh. I understand where the confusion might have arisen. I wanted the
> Martel code to be as lean as possible so I didn't keep track of
> positions on the assumption that downstream code could keep a running
> sum of the characters(). But as you point out, it *appears* to keep
> track of record positions - which only fails when there is more than
> SIZEHINT data - so especially given the lack of appropriate documentation,
> people may that position is valid.
Ah ha! Thanks, guys, for the clear explanation. Yeah, I was just plain
doing things wrong here, but as Andrew explains, there probably isn't
a way to exactly do this as things are right now.
So... I think a good solution may be to start working to switch over
to the indexing capabilities offered by Mindy, the indexer that Andrew
wrote that uses Martel. Last week, when I was (fruitlessly) working on
this bug, I took a look at how Andrew keeps track of positions in
Mindy, and this is all done through callbacks. For me to do it this
way, and also offer the possibility to index files by different
elements other than accession number, I'd end up rewriting Mindy
before I was done, which isn't a good thing :-).
I have already written indexing functions that use Mindy. I have been
using these in my own work for a while, and they work quite stably,
even for large files that index_file couldn't handle before. I have
been happily using it on the entire Arabidopsis genome without
problems.
The code for this is available in CVS, and it works almost like
index_file and Dictionary, except that you use:
index_file_db(file, name_of_database, directory_to_put_database_in)
and
MindyDictionary(name_of_database, directory_database_is_in, parser)
Using this requires mindy, available from:
http://www.biopython.org/~dalke/mindy-0.1.tar.gz
To use it, you'll need to put this somewhere on your PYTHONPATH in a
directory named 'mindy', and also need to apply the attached small
patch to mindy_index.py, which adds support to prevent having to
reindex already indexed files that haven't changed (to read more about
this, check out the development list from March:
http://www.biopython.org/pipermail/biopython-dev/2001-March/000320.html
Does this seem like a good solution to people? I don't know how Andrew
feels about this, but I'd rather promote mindy then rewrite the
functionality that is already there. Jan, will this work for you?
Thanks for the bug report and discussion!
Brad
--s3x0BFOTiF
Content-Type: text/plain
Content-Description: Patch for mindy_index
Content-Disposition: inline;
filename="mindy_index.diff"
Content-Transfer-Encoding: 7bit
--- mindy_index.py.orig Mon Mar 19 06:15:14 2001
+++ mindy_index.py Sun Mar 25 10:26:13 2001
@@ -2,7 +2,7 @@
See the usage for more information.
"""
-
+import os
import sys
from xml.sax import handler
from bsddb3 import db, dbshelve
@@ -141,6 +141,7 @@
self.keywords = keywords
self.filename = None
self._filenames = {}
+ self._file_sizes = {}
self._abbrevs = {}
def add_filename(self, filename):
@@ -153,6 +154,9 @@
self._abbrevs[filename] = str(abbrev)
self.mindy_data["filenames"] = self._filenames
+
+ self._file_sizes[filename] = os.path.getsize(filename)
+ self.mindy_data["file_sizes"] = self._file_sizes
def use_filename(self, filename):
if not self._abbrevs.has_key(filename):
--s3x0BFOTiF--