[Bioperl-l] Announcing Bio::SFF

Peter Cock p.j.a.cock at googlemail.com
Mon Dec 19 14:31:18 UTC 2011


On Mon, Dec 19, 2011 at 2:19 PM, Leon Timmermans
<l.m.timmermans at students.uu.nl> wrote:
> On Wed, Dec 14, 2011 at 5:44 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:
>
>> There are two widely used indexes, both from Roche (one with and
>> one without an XML manifest, magic bytes .mft and .srt). They are
>> both just a simple table of the reads names and offsets, sorted
>> alphabetically.
>
> Yeah, that's what I got from the BioPython code. I didn't know it
> was sorted though (it doesn't make much sense either, unless they
> wanted to do a binary search or something).

I presume that's what Roche uses if they keep the index on disk.

The alternative is to load the index into RAM, which is really fast.
You just open the SFF, read the header, seek to the index, load
the index. Without the index, you have to scan the entire SFF file
to find each record and its offset - which is much slower.

>> This works pretty well for rapid lookup for SFF files
>> (because the read count is not so high), and is pretty easy.
>
> It's implemented in Bio::SFF 0.003. I did restructure my code into two
> readers though, since doing sequential and random-access in the class
> didn't make much sense code-wise.
>
>> I don't think anyone used the hash table style indexes (.hsh), which
>> I assume was a proof of principle or trial in the early days of SFF.
>
> I see, too bad.
>
>> One thing to check is what Ion Torrent's SFF files use. I would
>> guess they've followed Roche, but I don't know. After all, the
>> index structure is not defined in the SFF specification - it was
>> left extensible on purpose.
>
> Yeah, we should check that too.

I don't have any Ion Torrent data first hand, and the public
samples I've seen were FASTQ not SFF. But I know a few
people with Ion Torrent machines that might be able to help...

> It's added to 0.003. The lack of tests was bothering me, but the
> SFFs I had at hand were not suitable.

Have you looked at the sample SFF data in Biopython? Please
use them for the BioPerl unit tests (we're been talking about a
cross project collection of test data files like this), the README
file should be self-explanatory:
https://github.com/biopython/biopython/tree/master/Tests/Roche

Peter



More information about the Bioperl-l mailing list