[Biojava-l] .sff support

Peter biopython at maubp.freeserve.co.uk
Fri Feb 26 13:33:19 UTC 2010


On Thu, Feb 25, 2010 at 10:08 AM, Charles Imbusch <charles at imbusch.net> wrote:
>
> Dear Peter,
>
> thanks for your mail. I will try to make use of that index
> to speed things up when I have time available.
>
> Cheers,
>  Charles

Hi Charles,

If found when you want random access to the reads, loading the
provided .mft or .srt index is MUCH faster than scanning the whole
file to build the index manually. So this really is worth the effort.

I hope the comments in my code are reasonably clear, but to
recap the key idea of the index block is you get chunks of data
of varying length (although typically all the same length since by
default all the Roche reads have the same read length) like this

name, null char, four character offset, terminator char of 0xFF

You divide the index block into entries for each read by
finding the 0xFF terminators. Because 0xFF (decimal 255)
is used in this way, it cannot be used to encode the offsets
which must only use 0x00 to 0xFE (decimal 0 to 254). The
offset therefore uses base 255 instead of base 256.

Note that this means that the largest offset the current
Roche index blocks can hold is 255^4, or a little under 4GB.
If you use the Roche tools to try and merge SFF files to
make an example SFF file over 4GB you get a warning
that there will be no index (and no manifest).

The index holds the reads sorted alphabetically by name.
We don't take advantage of this in Biopython since I use
a Python dictionary (like a Perl hash) to store the offsets.

In case you missed them, I'd like to draw your attention
to the SFF files I am using in the Biopython unit tests:
http://github.com/biopython/biopython/tree/master/Tests/Roche/

Regards,

Peter




More information about the Biojava-l mailing list