[Biopython-dev] Indexing sequences compressed with BGZF (Blocked GNU Zip Format)

Peter Cock p.j.a.cock at googlemail.com
Tue Nov 8 18:28:04 UTC 2011


On Tue, Nov 8, 2011 at 6:11 PM, Kevin Jacobs wrote:
> On Tue, Nov 8, 2011 at 12:52 PM, Peter Cock wrote:
>> On Tue, Nov 8, 2011 at 5:40 PM, Kevin Jacobs wrote:
>> > I've added a proper LRU uncompressed block cache to the
>> > samtools tabix code, if that would be of any help. It greatl
>> > improves performance for many access patterns.
>> > (I didn't look to see if you'd already done that in your
>> > code.)
>> > -Kevin
>>
>> Hi Kevin,
>>
>> Is this already in the mainline samtools tabix repository?
>>
>> The current implementation in my Python code just caches the
>> current block - but a simple pool had occurred to me. How many
>> blocks (given each is 64kb) and how best to pick that number
>> isn't obvious to me. Perhaps you can suggest some sensible
>> defaults?
>>
>> In fact, a proper LRU cache would make sense for the handle
>> pool in Bio.SeqIO.index_db(...) as well.
>>
>
> Hi Peter,
>
> There is a random-eviction cache implemented in the mainline that is okay,
> but it is turned off by default and, if enabled, can be very inefficient if
> it keeps evicting your most active blocks.  Converting the cache it to LRU
> was very simple and I've been using it locally for some time now, but I
> haven't had time to send the changes on to Heng Li.

Are your changes on github or somewhere public? Heng Li has the
core samtools bit of the samtools SVN on github, which he seems
to use for experimental new code: https://github.com/lh3/samtools

> I choose the size of the cache based on the application and access patterns.
>  For roughly sequential sequence queries (a la samtools faidx or Pysam
> Fastafile), all one needs is a handful of active blocks (say 16).  When
> repeated querying tabix files via pysam, I typically use 128 blocks for the
> best trade-off between memory and performance.  Choosing a cache size for
> BAM files is much more complicated and I have a wide-range of setting
> depending on how many parallel BAM streams and access patterns are employed.
> The cache size numbers needed to be quite a bit larger before switching to
> LRU (which was a bit surprising).  However, using even a small cache is
> vastly beneficial for many access patterns.   The cost of re-reading a block
> from disk can be mitigated by the OS filesystem cache, but the decompression
> step takes non-trivial CPU time and can be triggered dozens of hundreds of
> times per block for some sensible-seeming access patterns.
> -Kevin

Certainly useful food for thought - thank you. I agree that the OS
will probably cache commonly used BGZF blocks in the filesystem
cache, but it doesn't solve the CPU overhead of decompression.

In the case of Bio.SeqIO.index(...) which accesses one file, and
Bio.SeqIO.index_db(...) which may access several files, we currently
don't offer any end user options like this. However, there is an internal
option for the max number of handles, and a similar option could
control the number of BGZF blocks to cache. I could try 100
blocks (100 times 64kb is about 6MB) as the default, and redo
the UniProt timings (random access to sequences).

That might be a good compromise, given the SeqIO indexing code
has no easy way to know the calling code's usage patterns.

As I said on the blog post, we should be able to improve the
speed of the BGZF random access - this idea alone could
make a big difference, although probably a naive block cache
(rather than LRU) would be a worthwhile step in itself.

Regards,

Peter




More information about the Biopython-dev mailing list