[Biopython-dev] sff reader

Fri May 22 22:54:32 UTC 2009

Peter and Jose;
I haven't used SFF files myself as we don't have a 454 machine, but
do know of a couple of implementations of SFF TO Fastq/Fasta. 
Flower is a Haskell implementation:

http://blog.malde.org/index.php/flower/

And PyroBayes is a 454 base caller:

http://bioinformatics.bc.edu/marthlab/PyroBayes

Depending on what you all end up doing, these might be useful as
comparison points, or for wrapping with Application command lines.

Brad

> On Fri, Apr 17, 2009 at 12:08 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> > On Fri, Apr 17, 2009 at 11:46 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
> >> Hi Peter:
> >> Here you have some code to read the sff files.
> >
> > Thanks - I'm not sure when I'll get to look at this, maybe next week.
> >
> >> For the time being it creates a dict for the sequences. I'm not sure about
> >> how to integrate the generated data in BioPython. The sequence and
> >> qualities should go to a SeqRecord, but there is also the information
> >> about the clipping.
> >
> > For Bio.SeqIO, we would need to use a SeqRecord.  Ideally we'd want to
> > be able to read and write SFF files, and to do that we'll have to record all
> > the essential annotation (i.e. clipping) somehow.
> 
> I've had a look at your code this evening, and written a rough SeqIO
> module using it, available here on enhancement Bug 2837,
> http://bugzilla.open-bio.org/show_bug.cgi?id=2837
> 
> > Can you write SFF files?
> >
> >> For my work I use a kind of SeqRecord with a mask property and the
> >> mask is a Location that shows which part of the sequence is ok. I don't
> >> know if that's a valid model for BioPython.
> >
> > A mask could be done as a list of booleans, and we can treat it as
> > another per-letter-annotation in the SeqRecord.  I'm not sure if this
> > is helpful or not.
> >
> > The Roche tools let you choose to extract trimmed reads as FASTA
> > and QUAL, or untrimmed.  Perhaps for reading SFF files with
> > Bio.SeqIO we should get the user to choose between these
> > options (e.g. format names "roche-sff" and "roche-sff-notrim")?
> 
> This would work...
> 
> > Roche's FASTA files use upper case for the trimmed region, and
> > lower case for the start/end which would get trimmed off. This is
> > simple and we could do this for Biopython too - meaning you'd get
> > the same data if you read the SFF file directly, or used Roche's
> > FASTA+QUAL files with SeqIO.  Note that when reading an SFF
> > file directly, we should probably record the real trim data as well.
> 
> In my current code, I decided to use the same quality trimming
> representation that Roche use if converting the SFF file into FASTA
> format (the leading and trailing trim regions are in lower case). We
> may want to record the trim positions in the SeqRecord's annotation
> as well.
> 
> >> There's also a couple of more tricks with the clipping.
> >> In theory there's clip_qual and clip_adapter, but in the files
> >> we've seen clip_adapter is always zero and clip_quality is used
> >> instead for both quality and adapter. I think we could generate
> >> one clipping combining both. Let me know what do you think.
> >> Also take into account that in some cases the generated clipping
> >> from the 454 software are just wrong.
> >
> > I'll need to learn more about the details before coming to any
> > conclusions about how to deal with this information in Biopython.
> 
> Right now I have not looked at the left/right adaptor clipping information,
> as you found, in the example file I have looked at these fields are zero.
> 
> Note I will be away for the next week, so am unlikely to respond to
> any emails on this.
> 
> Peter
> 
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev