[Biopython-dev] sff reader

Peter biopython at maubp.freeserve.co.uk
Thu Aug 13 17:33:41 UTC 2009


[Jose - you didn't CC the list with your reply]

On Wed, Aug 12, 2009 at 6:52 PM, Blanca Postigo Jose
Miguel<jblanca at btc.upv.es> wrote:
>
> Hi:
>
> I just love free software :) It's great to watch how the code is being improved
> by the work of so many people. I hope to get some time to get a look at the
> latest sff reader.

You'll probably be interested to know I've made some excellent progress
with the (optional) SFF index block. I note that the specifications (both
on the NCBI page and in the Roche manual) appear to suggest that the
index block could appear in the middle of the the read data. However,
in all the examples I have looked at, the index is actually at the end.

http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=format#sff

Sadly the format of the index isn't documented, but I think I have
reverse engineered the format that Roche SFF files are using. In a
slight twist of the specification they are actually using the index bock
for both XML meta data AND and index of the read offsets.

This will dovetail nicely with the indexing support in Bio.SeqIO
which I am working on for Biopython 1.52, branch on github.
I expect to have fast random access to reads in an SFF file
very soon. See http://github.com/peterjc/biopython/tree/convert

>> > It looks like you (James) construct Seq objects using the full
>> > untrimmed sequence as is. I was undecided on if trimmed or
>> > untrimmed should be the default, but the idea of some kind of
>> > masked or trimmed Seq object had come up on the mailing list
>> > which might be useful here (and in contig alignments). i.e.
>> > something which acts like a Seq object giving the trimmed
>> > sequence, but which also contains the full sequence and trim
>> > positions.
>>
>> I'm still thinking about this. One simplistic option (as used on
>> my branch) would be to have two input formats in Bio.SeqIO,
>> one untrimmed and one trimmed, e.g. "sff" and "sff-trim".
>
> I think that some way to mask the SeqRecord or Seq object
> would be great. It would be useful for many tasks, not just this
> one.

Sure - if we can come up with a suitable design...

Peter



More information about the Biopython-dev mailing list