[Biopython-dev] BGZF support, was Re: Biopython 1.60 plans and beyond

Tue Apr 17 15:23:22 UTC 2012

On Sat, Feb 18, 2012 at 9:39 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> Here are some things that I think are strong
> candidates for 1.60 (not an exclusive list!)
>
> ...
>
> BGZF support: Low level module like Python's gzip,
> support in SeqIO for indexing BGZF compressed files,
> ...

I've just rebased my bgzf branch, which I think is ready to apply to the
trunk. It has been tested under Python 2, PyPy [*], Jython and Python 3.
https://github.com/peterjc/biopython/tree/bgzf2

Would anyone like to review this please? There are unittests and
plenty of docstrings - but so far nothing in the Tutorial though.

I wrote a blog post late last year explaining what this allows, and
this branch includes the changes to Bio.SeqIO to index BGZF
compressed sequence files this discussed:
http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html

The probable next step after this is combining it with Andrew Sczesnak's
work on indexing MAF files (they can get pretty big) as explored by 'I.J.'
(who as far as I know hasn't signed up to the biopython-dev list, BCC'd).

Also it would be interesting to explore doing the (de)compression of
blocks on worker threads to take advantage of multiple cores.

Another idea would be too switch from a plain dictionary to an
ordered dictionary for holding cached decompressed blocks,
giving a way to drop the oldest block (although not perhaps as
clever as dropping the lest recently used block, the overhead is
lower). That would require including our own OrderedDict class
on the older Python platforms.

Peter

[*] PyPy testing is complicated by running out of file handles,
an existing issue not something directly from this work. Part
of this is down to different GC under PyPy.