[BioRuby] [GSoC] Weekly report #3

Mon Jun 4 20:07:03 UTC 2012

> Have you got any good random access benchmarks to try this out
> with? i.e. something non-random, such as pulling mates of paired
> end reads.
>

Currently, no. Please suggest your ideas about benchmarks because I suspect
that you have much more experience with BAM files and better knowledge of
use patterns.

How many BGZF blocks are you keeping in the cache, and why?
>

Currently, 512. I don't know why, seems like a reasonable number (about
30MB of RAM). Maybe it should be a runtime parameter but I doubt that end
users will bother with tweaking cache size.

> Are you thinking about BGZF output yet (which will be required
> in order to write BAM files)?
>

It's not hard at all. I already wrote packing string to BGZF in Ruby:
https://github.com/lomereiter/bioruby-bgzf/blob/master/lib/bio-bgzf/pack.rb

Parallelizing should also be easy, it's very similar to reading blocks from
file. Determine how many alignments to pack in one block (it's 65Kb max),
send compression task to taskpool, then go create next chunk of alignments,
and so on.