[Biopython] SeqIO.index improvement suggestions

Fri Dec 18 21:39:11 UTC 2009

Hi Renato,

I'm cooking dinner while writing this, so it won't be as in depth as
usual...

On Fri, Dec 18, 2009 at 5:17 PM, Renato Alves <rjalves at igc.gulbenkian.pt> wrote:
>
> [I tried submitting this message to the dev mailing list, but got
> rejected since I'm not yet authorized to post there, so here it goes]

Have you definitely subscribed to the dev list? That should be all that
is required to post there, and this discussion would be better suited
there.

> Hi everyone,
>
> I'm working on changes to the Bio.SeqIO.index() function to make it more
> consistent with the .read and .parse i.e. accept a filehandle instead of
> a filename and also to include a way to cache the index into a file to
> speed up the process.
>
> The reason why we are implementing these two is because we were going to
> implement our own index solution until we realized this was added to 1.52.
>
> However the implementation in 1.52 has a few limitations.

Yes, this was designed to cover basic use cases in a general way,
but with the option in future to do other things - and in particular
saving the index to disk was kept in mind.

> One limitation is that we are using a gzipped database for the sake of
> space and using gzip.open() to create the file-handle that would then be
> passed to .parse(). The same was not doable with .index().
> This is already implemented in
> http://github.com/Unode/biopython/commit/6fc390151452e3ddf26a117269132125a3ffb3fe

That was a deliberate choice in that the index code wants to "own"
the handle. If other code has access to the handle, there is a risky
of different bits of code moving the handle pointer etc. But, if you
are careful it could be done.

There are also issues here in combination with saving the index.
With a filename, the code can easily reopen the file in the same
mode. With a handle, things are more tricky. You have non-file
handles to consider - such as the gzip example. There is also the
problem of recording the file mode (normal text, universal text,
or binary - which we will need for SFF files - code already written).

If we do change the code to allow handles, it would have to be
to allow handles OR filenames to be compatible with Biopython
1.52 and 1.53 (which take just filenames). This could be handled
as in Bio.SeqIO.convert(), which also allows both (which was the
subject of some discussion!).

> The second is that we are going to use this feature to quick search the
> database in a web application. Here we have the limitation that we don't
> have persistence across web requests, which means that we would need to
> recalculate the index on every web request.
>
> The details of how we plan to implement this are the following:
>
> cPickle the internal dictionary of offsets and save it on the database
> folder with the same name as the database + .index. The consistency
> check on whether the file has changed will be performed based on name
> and timestamp. By default .index() will search for this file, check the
> timestamp and use the cache if they match, otherwise they will be
> recalculated. The save function will be available like:
>
>>>> >>> d = SeqIO.index(...)
>>>> >>> d.save(filename)
>
> where filename is optional and defaults to "%s.index" % _handle.name
>
> We already have a solution like this implemented with subclasses of
> SeqIO._index, it's just a matter of reworking that and merge it into
> BioPython if you consider a good addition to the code.
>
> I would like to hear your comments and suggestions on this.

Yes, saving indexes is an obvious addition. I have explored
using pickle via shelve, and also SQLite - there are
implementations of this on my github respository, plus
begun to look into the existing OBF Open Biological
Database Access (OBDA) specification for cross project
compatibility. Other potential benefits here are reduced
memory usage if we don't keep the dictionary
of offsets in RAM.

http://github.com/peterjc/biopython/tree/index-shelve
http://github.com/peterjc/biopython/tree/index-sqlite

There is a potential complication with index sub-classes
which do more specialised indexing (e.g. GenBank files,
and for a more extreme case, SFF files). See:
http://github.com/peterjc/biopython/tree/sff-seqio

Anyway - great to see you are finding the code useful,
and have some quite similar ideas for how to extend
it further.

Peter