[Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?

Sat Jun 5 11:51:10 UTC 2010

On Sat, Jun 5, 2010 at 11:59 AM, Chris Fields wrote:
> On Jun 4, 2010, at 2:04 PM, Peter wrote:
>>
>> But (thus far) no sequence data is stored in HDF5 format (is it?).
>>
>> Peter
>
> There will be a presentation this year at BOSC on BioHDF (HDF5 for bioinformatics).
> There is a website:
>
> http://www.hdfgroup.org/projects/biohdf/

It looks like they are making good progress - with SAM/BAM conversion to and
from BioHDF in place. Still, as they say:

>>> The current BioHDF distribution is a pipleline prototype designed to show
>>> the suitability of HDF5 as a biological data store and to determine how to
>>> best implement an HDF5-based bioinformatics pipeline. It is in source code
>>> format only. The code builds a set of command-line tools which allow
>>> uploading and extracting DNA/RNA sequence and alignment data from
>>> next-generation gene sequencers. These files have been provided with the
>>> same BSD license used by HDF5
>>>
>>> ...
>>>
>>> Please be aware that the code contained in it will be in a high state of flux
>>> in the immediate future.

This certainly looks like something to keep an eye on.

In any case, getting back to the thread's purpose - Bio.SeqIO.index() aims to
give random access to sequences by their ID for many different file formats.
There has been little interest in extending this to support gzipped
files. However,
extending the code to store the id/offset lookup table on disk with SQLite3
(rather than in memory as a Python dict) would seem welcome. I'll be
refreshing the github branch where I was working on this earlier in the year...

Peter