[Biopython-dev] ANN: mindy-0.1

Sun Mar 25 12:23:44 EST 2001

Hi Andrew;

> WARNING! First attempt at a generalized indexer for bioinformatics
> formats using Martel.  All code experimental and subject to change
> with even less notice than usual!

Well, despite this friendly encouragement, I decided to take a look at 
mindy anyways :-).
 
> For the last few weeks I've been thinking about how to use Martel as
> part of a generalized database indexer.  Martel of course does all the
> required parsing, so it's a matter of converting the results into some
> indexable format.

This works great for me! I got this working with the GenBank format
and checked some experimental code into Bio.GenBank to make indexing
using mindy similar to the current indexing system. Using this
requires that you put mindy inside a mindy directory on your
PYTHONPATH and make it importable by adding a __init__.py.

Adding this allowed me to mess around with Mindy using code I already
had which used the standard indexer -- some comments on this are below.
 
> Instead I acknowledge the frequent common case where all needed fields
> are strictly contained in an element and wrote a content handler which
> lets you say something like

I have one fairly common problem with in the context of GenBank
records. Almost all of the time I want to index GenBank records using
the accession number (without the version). The problem with some
GenBank records is that they look like:

LOCUS       AC006837    87584 bp    DNA             PLN       05-APR-2000
DEFINITION  Arabidopsis thaliana chromosome II section 1 of 255 of the complete
            sequence. Sequence from clones F23H14.
ACCESSION   AC006837 AE002093
VERSION     AC006837.15  GI:6598619

and have two (or more) accession numbers. I think the second one is an 
old, now defunct, accession number for the same clone. The problem I
get with just indexing with mindy using "accession" is that everything 
will be indexed using the second accession number, and not the first
like I would like. 

What do you think about a good solution to this? Is is possible to
have multiple indexes pointing to the same record (ie. both AC006837
and AE002093 point to this record)? Am I stuck using XSLT or
something else for this case?

> The indexing system uses Robin Dunn's bsddb3 interface on top of
> Sleepycat Berkeley DB package.  You can get them from (respectively)
> 
>   http://pybsddb.sourceforge.net/
>   http://www.sleepycat.com/download.html

Just curious -- why'd you decide to use Berekeley DB?

> The lookup time is very fast.  

Yup, this is *really* nice!

> BUGS/TO DO/THOUGHTS:

> Would working with compressed files be useful?  (Even if slower for
> record retrieval?)

Yes, this would be really useful, at least for me. I always end up 
uncompressing and recompressing stuff before I work with them to keep
myself from filling up my hard disk. It would nice not to have to
always go through that cycle everytime I switch between projects I'm
working on.

> Would like to be able to add new files to a database.
> 
> Would like to remove/update files in a database.

Yeah, both would be really nice! It seems like there is some support
for this (?) but I didn't play with it.

> Could add a simple query language....
> 
> ..But then more general purposing tools should be used (mySQL?
> PostgreSQL?)

Hmm, would it be hard to support multiple backends? I don't really
know anything about Berkeley DB and just installed it blindly to use
this.


Another addition which I think would be nice is storing the size of
the indexed files. This would allow you to potentially skip an
indexing when index is called on a file. If a database is already
present for a file, it checks the stored size of the file versus the
current size of the file, and then skips a new indexing if it appears
up to date. This is what bioperl does, and I think it's very
useful. Anyways, here's a patch that stores this information. The
GenBank code I wrote uses this to check the size:

$ diff -u mindy_index.py.orig mindy_index.py

--- mindy_index.py.orig	Mon Mar 19 06:15:14 2001
+++ mindy_index.py	Sun Mar 25 10:26:13 2001
@@ -2,7 +2,7 @@
 
 See the usage for more information.
 """
-
+import os
 import sys
 from xml.sax import handler
 from bsddb3 import db, dbshelve
@@ -141,6 +141,7 @@
         self.keywords = keywords
         self.filename = None
         self._filenames = {}
+        self._file_sizes = {}
         self._abbrevs = {}
 
     def add_filename(self, filename):
@@ -153,6 +154,9 @@
         self._abbrevs[filename] = str(abbrev)
 
         self.mindy_data["filenames"] = self._filenames
+
+        self._file_sizes[filename] = os.path.getsize(filename)
+        self.mindy_data["file_sizes"] = self._file_sizes
 
     def use_filename(self, filename):
         if not self._abbrevs.has_key(filename):

Just another thought.

>   I'm back from all my travels so I'll be catching up on
> things (back email, bills, etc.) over the next few days.
> Just thought you all would like to know if I end up sending
> replies to old messages :)

Nice to have you back! BTW, since you are back and I have your
attention (hopefully :-), have you thought about adding Martel to the
CVS tree? I added support for installing it to the setup.py already,
so it should be almost "ready to go" if you are still in favor of
doing this.

Brad