[Biopython-dev] Quality scores (and per-letter-annotation) in a SeqRecord?
Peter
biopython at maubp.freeserve.co.uk
Sat Feb 21 19:03:14 UTC 2009
On Sat, Feb 21, 2009 at 12:24 AM, Iddo Friedberg <idoerg at gmail.com> wrote:
>
> Hi all,
>
> I am sort of living in this world right now, doing a lot of
> metagenomics, so here are my $0.02. I agree with Leighton (assuming I
> understand him): We should consider the possible applications people
> will run using the quality data when designing the [parser?]
Sure. By having the FASTQ and QUAL files integrated into Bio.SeqIO
(using SeqRecord objects) one simple use case is supported -
interconverting these files into other formats (e.g. FASTQ to FASTA,
or with a little more effort FASTA+QUAL to FASTQ). Your trimming
example is a another good use case - which could be done with the
SeqRecord representation.
For anything more complicated (like assembly or mapping onto a
genome), with massive datasets the modest overhead of the SeqRecord
and Seq objects could be an issue - but isn't this sort of thing is
usually best handled by an external tool (written in C or C++ by a
specialist)?
Anyway - If you have a look at Bug 2767 at the first attachment I did
the core of the FASTQ parser as a generic function returning a tuple
of strings (the record title, sequence and the encoded quality string
- see FastqGeneralIterator). While this could be just a private
function, I was thinking this could actually be very helpful for
anyone trying to do something where performance speed or memory usage
was important. On top of this core parser, I had a FastqPhredIterator
(and would similarly have a FastqSolexaIterator) function which turns
these into SeqRecord objects for use via the Bio.SeqIO API. i.e. We
can offer both the standard Bio.SeqIO interface using SeqRecords, and
a simpler string based parser for those that need it.
Peter
More information about the Biopython-dev
mailing list