[Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?
Chris Fields
cjfields at illinois.edu
Sat Jun 5 12:56:13 UTC 2010
On Jun 5, 2010, at 6:51 AM, Peter wrote:
> On Sat, Jun 5, 2010 at 11:59 AM, Chris Fields wrote:
>> On Jun 4, 2010, at 2:04 PM, Peter wrote:
>>>
>>> But (thus far) no sequence data is stored in HDF5 format (is it?).
>>>
>>> Peter
>>
>> There will be a presentation this year at BOSC on BioHDF (HDF5 for bioinformatics).
>> There is a website:
>>
>> http://www.hdfgroup.org/projects/biohdf/
>
> It looks like they are making good progress - with SAM/BAM conversion to and
> from BioHDF in place. Still, as they say:
>
>>>> The current BioHDF distribution is a pipleline prototype designed to show
>>>> the suitability of HDF5 as a biological data store and to determine how to
>>>> best implement an HDF5-based bioinformatics pipeline. It is in source code
>>>> format only. The code builds a set of command-line tools which allow
>>>> uploading and extracting DNA/RNA sequence and alignment data from
>>>> next-generation gene sequencers. These files have been provided with the
>>>> same BSD license used by HDF5
>>>>
>>>> ...
>>>>
>>>> Please be aware that the code contained in it will be in a high state of flux
>>>> in the immediate future.
>
> This certainly looks like something to keep an eye on.
>
> In any case, getting back to the thread's purpose - Bio.SeqIO.index() aims to
> give random access to sequences by their ID for many different file formats.
> There has been little interest in extending this to support gzipped
> files. However,
> extending the code to store the id/offset lookup table on disk with SQLite3
> (rather than in memory as a Python dict) would seem welcome. I'll be
> refreshing the github branch where I was working on this earlier in the year...
>
> Peter
We have seen (on the bioperl side) some interest in allowing gzip/bzip and others in via the PerlIO layer, and also AnyDBM using SQLite. Mark Jensen actually did a little work along these lines, though I'm not sure how clear-cut the support is at the moment.
chris
More information about the Biopython
mailing list