[Biopython-dev] Quality scores (and per-letter-annotation) in a SeqRecord?

Fri Feb 27 11:13:45 UTC 2009

On Sat, Feb 21, 2009 at 12:24 AM, Iddo Friedberg <idoerg at gmail.com> wrote:
>
> Hi all,
>
> I am sort of living in this world right now, doing a lot of
> metagenomics, so here are my $0.02. I agree with Leighton (assuming I
> understand him): We should consider the possible applications people
> will run using the quality data when designing the
>
> from what I have seen the  most common use for quality scores is for
> trimming the sequences, i.e. removing the lesser quality sequence data
> (usually on the edges) from the 5' and 3' ends of the read. So any data
> structure should take into consideration that we will probably have
> a .trim(self,threshold) method or function trim(seq, threshold) that
> will return a slice of the sequence.

I'm note convinced the SeqRecord needs a trim method (and if it did,
it would also need to take an argument saying which
per-letter-annotation should use, e.g. the PHRED qualities).  But yes,
this is an excellent example of where it would be very useful to have
the SeqRecord support slicing which also slices the quality
information (as recently discussed, with an implementation on Bug
2507).

I've got a related example use-case, trimming primer sequences from
the raw reads (and trimming the quality scores to match) before
assembly.   If the quality scores are recorded in a
per-letter-annotation dictionary which is integrated into SeqRecord
slicing, this becomes fairly straight forward.  First read in the data
(most simply from a FASTQ file). You look at the SeqRecord's seq to
determine where to cut the sequence, and then apply the slice to the
SeqRecord - this will give you a new SeqRecord with the appropriate
sub-sequence and the appropriate sub-list of the quality scores.  You
can then save this data, either as a FASTQ file, or paired FASTA and
QUAL files.

Peter