[Biopython-dev] Preparing for Biopython 1.50 (beta)

Tue Mar 17 14:46:53 UTC 2009

2009/3/16 David Schruth <dschruth at u.washington.edu>:
> I've got some 454 and Solid data you could test it on too.
>
> Has anybody else looked into how these other two Next Gen formats might
> complicate things?

Roche 454 sequencers produce their own binary SFF files (standing for
sequence file format?), but they provide tools which turn these into
standard Sanger style files using PHRED qualities.  In theory, we
might be able to parse the SFF files directly, see for example
http://blog.malde.org/index.php/2008/11/14/454-sequencing-and-parsing-the-sff-binary-format/
and the links given.  In practice, most sequencing centers using Roche
454 will be happy to provide FASTQ or FASTA+QUAL files, and the code
on Bug 2767 (or the associated experimental branch on github) should
work fine on these.
http://bugzilla.open-bio.org/show_bug.cgi?id=2767

You are free to try out the proposed code yourself now, but if you
have some particular 454 files you'd like me to check, please email me
(off the mailing list).  If you can share some real data which we
could include in Biopython for a unit test that would also be great
(but unless you tell me this explicitly, I'll only make sure we can
parse your files).

Regarding SOLiD files, they work in colour space and I am under the
impression that it doesn't make sense to convert them to sequence
space until after doing the assembly or genome mapping (in colour
space).  See for example
http://solidsoftwaretools.com/gf/project/mapreads/ i.e. It may not be
appropriate to parse SOLiD reads into Biopython SeqRecord objects, and
thus wouldn't belong in Bio.SeqIO.  That isn't to say we wouldn't want
a parser elsewhere in Biopython, perhaps under Bio.Sequencing would be
best.

Peter