[Biopython-dev] sff reader

Peter biopython at maubp.freeserve.co.uk
Wed Aug 12 12:54:15 UTC 2009


On Thu, Jul 23, 2009 at 10:34 AM, Peter wrote:
> On Wed, Jul 22, 2009 at 9:51 PM, James Casbon wrote:
>> I don't think there is much in it really.  You have a factored
>> BinaryFile class, I have classes for the components of the SFF file.
>> Both are based around struct.

I have now written a third variant (loosely based on Jose's code).
This is just a single generator function (also based on struct).
Right now it is a slightly long function, but it can be refactored
easily enough. Is also a lot faster than Jose's code which is a
big plus point for large files. See:
http://github.com/peterjc/biopython/tree/sff

I haven't compared my new code against yours for speed yet
James, because your parser didn't like my large SFF file. You
have hard coded it to expect read names of length 14, and
400 flows per read. I have some data from Sanger where the
read names are length 14, but there are 800 flows per read.

Having the two reference parsers to look at was educational,
so thank you both (James and Jose) for sharing your code.

I now understand the SFF file format much better, and am now
confident I could design an indexer to provide dictionary like
access to it - a possible addition to Bio.SeqIO - see this thread:
http://lists.open-bio.org/pipermail/biopython/2009-June/005312.html

> Jose's code uses seek/tell which means it has to have a handle
> to an actual file. He also used binary read mode - I'm not sure if
> this was essential or not.

Binary more was not essential - opening an SFF file in default
mode also seemed to work fine with Jose's code.

> James' code seems to make a single pass though the file handle,
> without using seek/tell to jump about. I think this is nicer, as it is
> consistent with the other SeqIO parsers, and should work on
> more types of handles (e.g. from gzip, StringIO, or even a
> network connection).

I've also avoided using seek/tell in my rewrite.

> It looks like you (James) construct Seq objects using the full
> untrimmed sequence as is. I was undecided on if trimmed or
> untrimmed should be the default, but the idea of some kind of
> masked or trimmed Seq object had come up on the mailing list
> which might be useful here (and in contig alignments). i.e.
> something which acts like a Seq object giving the trimmed
> sequence, but which also contains the full sequence and trim
> positions.

I'm still thinking about this. One simplistic option (as used on
my branch) would be to have two input formats in Bio.SeqIO,
one untrimmed and one trimmed, e.g. "sff" and "sff-trim".

Peter




More information about the Biopython-dev mailing list