[Biopython-dev] BGZF support, was Re: Biopython 1.60 plans and beyond

Fri Apr 20 22:35:59 UTC 2012

Peter,

My colleague was writing some code using MafIndex and commented how long 
it took her to download, decompress and index the human multiz 
alignments from UCSC. It seems like it'd be great to keep the files 
compressed... perhaps if the code works well enough we can convince UCSC 
to host bgzip'd copies (or maybe them available on one of our 
institutions servers).

Is I.J. interested in joining the community? I'd like to look into 
adding BGZF to MafIO and wouldn't want to duplicate I.J.'s effort. If 
not, could you put me in touch?

Andrew

On 04/17/2012 11:23 AM, Peter Cock wrote:
> On Sat, Feb 18, 2012 at 9:39 AM, Peter Cock<p.j.a.cock at googlemail.com>  wrote:
>>
>> Here are some things that I think are strong
>> candidates for 1.60 (not an exclusive list!)
>>
>> ...
>>
>> BGZF support: Low level module like Python's gzip,
>> support in SeqIO for indexing BGZF compressed files,
>> ...
>
> I've just rebased my bgzf branch, which I think is ready to apply to the
> trunk. It has been tested under Python 2, PyPy [*], Jython and Python 3.
> https://github.com/peterjc/biopython/tree/bgzf2
>
> Would anyone like to review this please? There are unittests and
> plenty of docstrings - but so far nothing in the Tutorial though.
>
> I wrote a blog post late last year explaining what this allows, and
> this branch includes the changes to Bio.SeqIO to index BGZF
> compressed sequence files this discussed:
> http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html
>
> The probable next step after this is combining it with Andrew Sczesnak's
> work on indexing MAF files (they can get pretty big) as explored by 'I.J.'
> (who as far as I know hasn't signed up to the biopython-dev list, BCC'd).
>
> Also it would be interesting to explore doing the (de)compression of
> blocks on worker threads to take advantage of multiple cores.
>
> Another idea would be too switch from a plain dictionary to an
> ordered dictionary for holding cached decompressed blocks,
> giving a way to drop the oldest block (although not perhaps as
> clever as dropping the lest recently used block, the overhead is
> lower). That would require including our own OrderedDict class
> on the older Python platforms.
>
> Peter
>
> [*] PyPy testing is complicated by running out of file handles,
> an existing issue not something directly from this work. Part
> of this is down to different GC under PyPy.