[Biopython-dev] Sequential SFF IO

Peter Cock p.j.a.cock at googlemail.com
Wed Jan 26 17:19:36 UTC 2011

On Wed, Jan 26, 2011 at 4:44 PM, Kevin Jacobs <jacobs at bioinformed.com>
<bioinformed at gmail.com> wrote:
> On Wed, Jan 26, 2011 at 10:45 AM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
>> On Wed, Jan 26, 2011 at 3:14 PM, Kevin Jacobs wrote:
>> > Any objections/worries about converting the SFF writer to use the
>> > sequential/incremental writer object interface?  I know it looks
>> > specialized for text formats, but
>> It already uses Bio.SeqIO.Interfaces.SequenceWriter
> Sorry-- was shooting from the hip.  I meant a SequentialSequenceWriter.

The file formats which use SequentialSequenceWriter have trivial
(or no) header/footer, which require no additional arguments. The
SFF file format has a non-trivial header which records flow space
settings etc. Any write_header method would have to be SFF specific,
likewise any write_footer method for the index and XML manifest.
I don't see what you have in mind.

In fact, looking at SffIO.py again now, I think the SffWriter's
write_header and write_record method should be private with
just write_file as a public method.

>> > ... I need to split large SFF files into many smaller ones
>> > and would rather not materialize the whole thing.  The SFF writer
>> > code already allows for deferred writing of read counts and index
>> > creation, so it looks to be only minor surgery.
>> I don't understand what problem you are having with the SeqIO API.
>> It should be quite happy to take a generator function, iterator, etc
>> (as opposed to a list of SeqRecord objects which I assume is what
>> you mean by "materialize the whole thing").
> The goal is to demultiplex a larger file, so I need a "push" interface.
>  e.g.
> out = dict(...) # of SffWriters
> for rec in SeqIO(filename,'sff-trim'):
>   out[id(read)].write_record(rec)
> for writer in out.itervalues():
>   writer.write_footer()

I don't think the above will work without some "magic" to record the
SFF header (which currently would require using private attributes
of the SffWriter objects) as done via its write_file method.

Also you can't read in SFF files with "sff-trim" if you want to output
them, since this discards all the flow space information. You have
to use format "sff" instead.

> I could use a simple generator if I was merely filtering records, but the
> write_file interface would require more co-routine functionality than
> generators provide.

How many output files do you have? Assuming it is small I'd go for
the simple solution of one loop over the input SFF file for each output

A variation on this would be to make a list of read IDs for each
output file, then use the Bio.SeqIO.index for random access to
the records to get the records, e.g.

records = SeqIO.index(original_filename, "sff")
for filename in [...]:
    wanted = [...] # some list or generator
    records = (records[id] for id in wanted)
    SeqIO.write(records, filename, "sff")

Otherwise look at itertools.tee for splitting the iterator if you really
want to make a single pass though the original SFF file.

>> > There doesn't seem to be an obvious API for obtaining such a
>> > writer using the SeqIO interface.
>> You can do that with:
>> from Bio.SeqIO.SffIO import SffWriter
> For my immediate need, this is fine.  However, the more general
> API doesn't have a SeqIO.writer to get SequentialSequenceWriter
> objects.

For good reason - not all the writers use SequentialSequenceWriter,
because for many file formats it is too narrow in scope.


More information about the Biopython-dev mailing list