[Biopython-dev] Sequential SFF IO
p.j.a.cock at googlemail.com
Wed Jan 26 17:19:36 UTC 2011
On Wed, Jan 26, 2011 at 4:44 PM, Kevin Jacobs <jacobs at bioinformed.com>
<bioinformed at gmail.com> wrote:
> On Wed, Jan 26, 2011 at 10:45 AM, Peter Cock <p.j.a.cock at googlemail.com>
>> On Wed, Jan 26, 2011 at 3:14 PM, Kevin Jacobs wrote:
>> > Any objections/worries about converting the SFF writer to use the
>> > sequential/incremental writer object interface? I know it looks
>> > specialized for text formats, but
>> It already uses Bio.SeqIO.Interfaces.SequenceWriter
> Sorry-- was shooting from the hip. I meant a SequentialSequenceWriter.
The file formats which use SequentialSequenceWriter have trivial
(or no) header/footer, which require no additional arguments. The
SFF file format has a non-trivial header which records flow space
settings etc. Any write_header method would have to be SFF specific,
likewise any write_footer method for the index and XML manifest.
I don't see what you have in mind.
In fact, looking at SffIO.py again now, I think the SffWriter's
write_header and write_record method should be private with
just write_file as a public method.
>> > ... I need to split large SFF files into many smaller ones
>> > and would rather not materialize the whole thing. The SFF writer
>> > code already allows for deferred writing of read counts and index
>> > creation, so it looks to be only minor surgery.
>> I don't understand what problem you are having with the SeqIO API.
>> It should be quite happy to take a generator function, iterator, etc
>> (as opposed to a list of SeqRecord objects which I assume is what
>> you mean by "materialize the whole thing").
> The goal is to demultiplex a larger file, so I need a "push" interface.
> out = dict(...) # of SffWriters
> for rec in SeqIO(filename,'sff-trim'):
> for writer in out.itervalues():
I don't think the above will work without some "magic" to record the
SFF header (which currently would require using private attributes
of the SffWriter objects) as done via its write_file method.
Also you can't read in SFF files with "sff-trim" if you want to output
them, since this discards all the flow space information. You have
to use format "sff" instead.
> I could use a simple generator if I was merely filtering records, but the
> write_file interface would require more co-routine functionality than
> generators provide.
How many output files do you have? Assuming it is small I'd go for
the simple solution of one loop over the input SFF file for each output
A variation on this would be to make a list of read IDs for each
output file, then use the Bio.SeqIO.index for random access to
the records to get the records, e.g.
records = SeqIO.index(original_filename, "sff")
for filename in [...]:
wanted = [...] # some list or generator
records = (records[id] for id in wanted)
SeqIO.write(records, filename, "sff")
Otherwise look at itertools.tee for splitting the iterator if you really
want to make a single pass though the original SFF file.
>> > There doesn't seem to be an obvious API for obtaining such a
>> > writer using the SeqIO interface.
>> You can do that with:
>> from Bio.SeqIO.SffIO import SffWriter
> For my immediate need, this is fine. However, the more general
> API doesn't have a SeqIO.writer to get SequentialSequenceWriter
For good reason - not all the writers use SequentialSequenceWriter,
because for many file formats it is too narrow in scope.
More information about the Biopython-dev