[Biopython-dev] Quality scores (and per-letter-annotation) in a SeqRecord?

Fri Feb 20 11:15:57 UTC 2009

Over on enhancement Bug 2767, I have uploaded parsers and writers for
the FASTQ and QUAL file format, which both hold PHRED style quality
scores (integers ranging from 0 to about 90).  See
http://bugzilla.open-bio.org/show_bug.cgi?id=2767

One open question in this enhancement is how to store these PHRED
quality scores in the SeqRecord.  Keep in mind that there is more than
one type of quality score in use, for example Solexa/Illumina use a
different scaling (although it is possible to map between them without
too much trouble for the mid range scores), something I hadn't noticed
when we last talked abut this (Sept 2008). See:
http://lists.open-bio.org/pipermail/biopython-dev/2008-September/004250.html

For the initial code on Bug 2767, I took the simple and extensible
route of recording the PHRED qualities as a list of integers in the
SeqRecord's annotation dictionary under the key "phred_quality".

There are a couple of drawbacks.  Firstly, sequencing qualities are a
good example of per-letter-annotation (others include secondary
structure, atomic coordinates - which would apply to proteins as well
as nucleotides).  If we want to be able to slice a SeqRecord (Bug
2507) then it is important to distinguish between general annotation
(like the source species) and per-letter-annotation (which should also
be sliced).  One way of dealing with this is to introduce a
per-letter-annotation dictionary for the SeqRecord, whose entries
would be strings/lists with a length equal to that of the sequence.

Secondly, putting the PHRED qualities inside an annotations dictionary
(or even a per-letter-annotation dictionary) doesn't make them very
accessible.  If you are wanting to work with sequencing reads, then
the sequence, quality and identifier are all key properties.  In bug
2767 comment #5 Jose wrote:

> Regarding where to store the quality information in the SeqRecord I'm in
> favor of using a property named .qual or .quality. That is consistent with
> the actual .seq property. I think this approach is cleaner, and it is very
> easy to implement and to understand.

I can certainly appreciate that a top level property is easier to use
- and perhaps quality scores are important enough to justify this.
However, what about PHRED qualities versus Solexa/Illumina qualities,
or another sequencing system's scheme?  I hadn't thought about this
incompatibility when we were discussion this on the mailing list last
year (Sept 2008). I suppose you could consider adding a .phred_quality
property which is explicit, but then you'd end up with many different
properties.  Then there are other per-letter quality annotations - you
might want the A, C, G and T intensity from capillary sequencing (four
sets of numbers, not just one).  Plus of course this doesn't address
non-quality related per-letter-annotations (like secondary structure,
or atomic coordinates).

My point is that if we can't give top level properties to everything,
hence the original introduction of the annotations dictionary in the
first place.  Only a handful of really important things got their own
properties (id, name, description and the sequence itself).  If there
was only ONE key quality score, then I wouldn't mind making an
exception so much - but that doesn't seem to be the case.

Peter