[GSoC] Weekly report #3
Peter Cock
p.j.a.cock at googlemail.com
Mon Jun 4 19:36:25 UTC 2012
On Mon, Jun 4, 2012 at 7:02 PM, Artem Tarasov <lomereiter at googlemail.com> wrote:
> Hello all,
>
> the post is here:
> http://lomereiter.wordpress.com/2012/06/04/gsoc-weekly-report-3/
>
> I've implemented random access to BAM file, using index file. Also I
> created a generic function for memoization which stores decompressed
> blocks in cache, following some desired cache strategy. Currently, I
> use simple FIFO cache.
That sounds good. We've talked a little bit about the block caching
strategy for Biopython's BGZF support - dropping the least recently
used block would be good (LRU) but requires the overhead of storing
and recording timestamps on each access.
Currently my Biopython BGZF code just drops a cached block 'at
random' (actually based on the dictionary hashing algorithm), and
switching to FIFO was something I planned to try next (easily done
with Python's OrderedDict class). FIFO seems like a good solution
as the overheads are much lower than LRU.
Have you got any good random access benchmarks to try this out
with? i.e. something non-random, such as pulling mates of paired
end reads.
How many BGZF blocks are you keeping in the cache, and why?
Are you thinking about BGZF output yet (which will be required
in order to write BAM files)?
Regards,
Peter
More information about the GSoC
mailing list