[Biopython-dev] Quality scores (and per-letter-annotation) in a SeqRecord?

Brad Chapman chapmanb at 50mail.com
Fri Feb 20 23:19:04 UTC 2009


Hi all;
Good points on this debate so far. What do you all think about a
hybrid approach where the .quality attribute is a dictionary? The
keys would be the quality type ("phred", "solexa"...) and the values
would be a list or string the same length as the sequence.

For slicing, all of the quality dictionary values would be sliced
identically to the sequence itself. For BioSQL storage the quality
items would go in as annotations with names as a concatenation
of the attribute and type ("quality_phred").

Treating these specially on the BioSQL in/out is a little hack-y,
but quality is likely important enough to not bury it.

For Leighton's idea of generalization you could either:

- Derive a heavy-weight SeqRecord class from the base class that
  added a several additional per-symbol cases.

- Provide a generic per_symbol_annotations attribute that collected
  these as a dictionary of dictionaries:

  dict(quality = dict(phred = [20, 30]),
       hydrophobicity = dict(some_predictor = ['some', 'scores'])
      )

These could map to generic attributes in the same way and follow the
same slicing rules. After writing this up, I think the second idea
is better and probably exactly what Leighton was proposing.

Brad

> Another 2p... I collect them, you know...
> 
> An additional determinant of how these values are best scored is: "What will
> they be used for?".
> 
> If the only use they would ever find was to accompany a sequence so that its
> file format could be converted from one with embedded qualities to a format
> that required two such files (or vice-versa), then straightforward storage
> as a string in a dictionary is all that's needed.  This would be sufficient
> for conversion between some quality scores, as a utility function could just
> grab the stored string (given an appropriate name for each quality format).
> The question of how these per-symbol annotations would be modified when
> returning a Seq slice or join may be an issue.
> 
> If 'live' access to the values is required for calculation or alignment
> purposes, then a different interface might be more useful, permitting
> slicing, base selection on the basis of quality, or other operation.  This
> use case is more complex, as the return value is likely to be dependent on
> the quality format (single- or multiple-value per base).
> 
> Conceptually, I see quality scores as annotations of a sequence, rather than
> an intrinsic property of the sequence, so am happy for them to live in the
> same place other annotations do.  I also see them as only one instance of a
> class of per-symbol annotations (along with hydrophobicity scores, secondary
> structure predictions, read map counts and several other measures).  I
> think, therefore, that there is a case for a class describing per-symbol
> annotations to a Seq, and placing these in a dictionary of per-symbol
> annotations.  Slices of the parent Seq could then be propagated downwards to
> all members of that dictionary (which would also be expected to implement
> the same string-like methods as the parent).
> 
> The per-symbol annotation objects could be subclassed and/or contain a
> descriptive string from a controlled vocabulary to indicate their format,
> for standard interfacing with external packages (e.g. Drawing TOPS diagrams
> from secondary structure predictions or rendering base quality profiles),
> which I think would be a flexible approach.
> 
> On 20/02/2009 11:49, "Jose Blanca" <jblanca at btc.upv.es> wrote:
> 
> >> I suppose you could consider adding a .phred_quality
> >> property which is explicit, but then you'd end up with many different
> >> properties.  Then there are other per-letter quality annotations - you
> >> might want the A, C, G and T intensity from capillary sequencing (four
> >> sets of numbers, not just one).  Plus of course this doesn't address
> >> non-quality related per-letter-annotations (like secondary structure,
> >> or atomic coordinates).
> >> 
> >> My point is that if we can't give top level properties to everything,
> >> hence the original introduction of the annotations dictionary in the
> >> first place.  Only a handful of really important things got their own
> >> properties (id, name, description and the sequence itself).  If there
> >> was only ONE key quality score, then I wouldn't mind making an
> >> exception so much - but that doesn't seem to be the case.
> > That's a very good point. It wouldn't be wise to populate the SeqRecord class
> > with a lot of properties.
> > Another posible approach would be to create a derived class for that a
> > SeqWithQuality. It would be like a SeqRecord but with a .quality property.
> > For other cases other classes could be derived from SeqRecord.
> > The problem with putting the quatilies in a dict with all the other per base
> > annotation is that it has a different behaviour than the .seq case. The seq
> > case is special because is much more used, so maybe that's fair enough.
> > I don't know, maybe it is wiser to set all the per case annotations in a dict
> > a let the sequence outside. In that way we won't be creating a lot of new
> > classes derived from SeqRecord.
> > The more I think about the dict possibility, the more I like it.
> 
> -- 
> Dr Leighton Pritchard MRSC
> D131, Plant Pathology Programme, SCRI
> Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
> e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
> gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405
> 
> 
> ______________________________________________________________________
> SCRI, Invergowrie, Dundee, DD2 5DA.  
> The Scottish Crop Research Institute is a charitable company limited by
> guarantee. 
> Registered in Scotland No: SC 29367.
> Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.
> 
> 
> DISCLAIMER:
> 
> This email is from the Scottish Crop Research Institute, but the views 
> expressed by the sender are not necessarily the views of SCRI and its 
> subsidiaries.  This email and any files transmitted with it are
> confidential
> 
> to the intended recipient at the e-mail address to which it has been 
> addressed.  It may not be disclosed or used by any other than that
> addressee.
> If you are not the intended recipient you are requested to preserve this
> 
> confidentiality and you must not use, disclose, copy, print or rely on
> this 
> e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
> name of the sender and delete the email from your system.
> 
> Although SCRI has taken reasonable precautions to ensure no viruses are 
> present in this email, neither the Institute nor the sender accepts any 
> responsibility for any viruses, and it is your responsibility to scan
> the email and the attachments (if any).
> ______________________________________________________________________
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev



More information about the Biopython-dev mailing list