[BioPython] Bug in Bio.GenBank.index_file()

Tue, 15 May 2001 21:15:47 -0400

--s3x0BFOTiF
Content-Type: text/plain; charset=us-ascii
Content-Description: message body text
Content-Transfer-Encoding: 7bit

Hi Jan;

Jan:
> >we have come across a bug in the Bio.GenBank.index_file() function.

Yes, I have also been fighting with this for a while. I know the way I 
did things was an ugly hack, and it was also failing for me on large
files. I hadn't been able to figure out what I couldn't reproduce it
with a smaller file, but now:

Jan:
> >The problem is that Bio.GenBank.index_file() directly accesses the
> >positions member of a Martel.RecordReader, apparently assuming to find
> >file positions of record starts there.

Andrew:
> Ahhh.  I understand where the confusion might have arisen.  I wanted the
> Martel code to be as lean as possible so I didn't keep track of
> positions on the assumption that downstream code could keep a running
> sum of the characters().  But as you point out, it *appears* to keep
> track of record positions - which only fails when there is more than
> SIZEHINT data - so especially given the lack of appropriate documentation,
> people may that position is valid.

Ah ha! Thanks, guys, for the clear explanation. Yeah, I was just plain 
doing things wrong here, but as Andrew explains, there probably isn't
a way to exactly do this as things are right now. 

So... I think a good solution may be to start working to switch over
to the indexing capabilities offered by Mindy, the indexer that Andrew 
wrote that uses Martel. Last week, when I was (fruitlessly) working on 
this bug, I took a look at how Andrew keeps track of positions in
Mindy, and this is all done through callbacks. For me to do it this
way, and also offer the possibility to index files by different
elements other than accession number, I'd end up rewriting Mindy
before I was done, which isn't a good thing :-).

I have already written indexing functions that use Mindy. I have been
using these in my own work for a while, and they work quite stably,
even for large files that index_file couldn't handle before. I have
been happily using it on the entire Arabidopsis genome without
problems.

The code for this is available in CVS, and it works almost like
index_file and Dictionary, except that you use:

index_file_db(file, name_of_database, directory_to_put_database_in)

and 

MindyDictionary(name_of_database, directory_database_is_in, parser)


Using this requires mindy, available from:

http://www.biopython.org/~dalke/mindy-0.1.tar.gz

To use it, you'll need to put this somewhere on your PYTHONPATH in a
directory named 'mindy', and also need to apply the attached small
patch to mindy_index.py, which adds support to prevent having to
reindex already indexed files that haven't changed (to read more about 
this, check out the development list from March:

http://www.biopython.org/pipermail/biopython-dev/2001-March/000320.html

Does this seem like a good solution to people? I don't know how Andrew 
feels about this, but I'd rather promote mindy then rewrite the
functionality that is already there. Jan, will this work for you? 

Thanks for the bug report and discussion!
Brad


--s3x0BFOTiF
Content-Type: text/plain
Content-Description: Patch for mindy_index
Content-Disposition: inline;
	filename="mindy_index.diff"
Content-Transfer-Encoding: 7bit

--- mindy_index.py.orig	Mon Mar 19 06:15:14 2001
+++ mindy_index.py	Sun Mar 25 10:26:13 2001
@@ -2,7 +2,7 @@
 
 See the usage for more information.
 """
-
+import os
 import sys
 from xml.sax import handler
 from bsddb3 import db, dbshelve
@@ -141,6 +141,7 @@
         self.keywords = keywords
         self.filename = None
         self._filenames = {}
+        self._file_sizes = {}
         self._abbrevs = {}
 
     def add_filename(self, filename):
@@ -153,6 +154,9 @@
         self._abbrevs[filename] = str(abbrev)
 
         self.mindy_data["filenames"] = self._filenames
+
+        self._file_sizes[filename] = os.path.getsize(filename)
+        self.mindy_data["file_sizes"] = self._file_sizes
 
     def use_filename(self, filename):
         if not self._abbrevs.has_key(filename):

--s3x0BFOTiF--