[GSoC] Weekly report #3

Artem Tarasov lomereiter at googlemail.com
Mon Jun 4 18:02:58 UTC 2012


Hello all,

the post is here:
http://lomereiter.wordpress.com/2012/06/04/gsoc-weekly-report-3/

I've implemented random access to BAM file, using index file. Also I
created a generic function for memoization which stores decompressed blocks
in cache, following some desired cache strategy. Currently, I use simple
FIFO cache.

Also I studied how to make SAM output faster. I came to the conclusion that
not only D standard library functions, but even ones of *printf family are
too slow for this purpose, because they have to parse format string.
Instead, I need to use specialized functions for printing integers and
floats. Currently, output is about 4x slower than in samtools. So I have to
take back some of my harsh words about its code and say that there is
something to learn from there. It indeed uses its own functions for integer
output, and also uses string buffer to do less calls (system functions
can't be inlined). I'll use this approach, too, so very soon my library
will be usable in pipelines, but only for output.

Then I'm going to move on to allow alignments to be modified and outputted
to BAM. After that, SAM parser needs to be implemented, and I'm going to
use Ragel (finite-state machine compiler) for that purpose. So by the
beginning of July I want to have SAM<->BAM conversion working, with a good
speed. Add to that first release of biogem, and those are my plans for this
month.



More information about the GSoC mailing list