[Biopython-dev] Indexing sequences compressed with BGZF (Blocked GNU Zip Format)

Peter Cock p.j.a.cock at googlemail.com
Tue Nov 8 17:52:59 UTC 2011


On Tue, Nov 8, 2011 at 5:40 PM, Kevin Jacobs wrote:
> I've added a proper LRU uncompressed block cache to the samtools tabix code,
> if that would be of any help.  It greatly improves performance for many
> access patterns.  (I didn't look to see if you'd already done that in your
> code.)
> -Kevin

Hi Kevin,

Is this already in the mainline samtools tabix repository?

The current implementation in my Python code just caches the
current block - but a simple pool had occurred to me. How many
blocks (given each is 64kb) and how best to pick that number
isn't obvious to me. Perhaps you can suggest some sensible
defaults?

In fact, a proper LRU cache would make sense for the handle
pool in Bio.SeqIO.index_db(...) as well.

Regards,

Peter




More information about the Biopython-dev mailing list