[Biopython-dev] Quality scores (and per-letter-annotation) in a SeqRecord?

Sat Feb 21 00:24:43 UTC 2009

Hi all,

I am sort of living in this world right now, doing a lot of
metagenomics, so here are my $0.02. I agree with Leighton (assuming I
understand him): We should consider the possible applications people
will run using the quality data when designing the

from what I have seen the  most common use for quality scores is for
trimming the sequences, i.e. removing the lesser quality sequence data
(usually on the edges) from the 5' and 3' ends of the read. So any data
structure should take into consideration that we will probably have
a .trim(self,threshold) method or function trim(seq, threshold) that
will return a slice of the sequence.

2) There is a certain optimization need. Quality scores usually appear
on high-throughput data, which today can mean around 3GBp per run. I am
not sure where this is going exactly, but maybe in the advent of high
throughput short-read based genomics we should think about a slim
SeqRecord to expedite processing of short read processing. Or simply
write some stuff wrapped around C.

./I

On Fri, 2009-02-20 at 18:19 -0500, Brad Chapman wrote:
> Hi all;
> Good points on this debate so far. What do you all think about a
> hybrid approach where the .quality attribute is a dictionary? The
> keys would be the quality type ("phred", "solexa"...) and the values
> would be a list or string the same length as the sequence.
> 
> For slicing, all of the quality dictionary values would be sliced
> identically to the sequence itself. For BioSQL storage the quality
> items would go in as annotations with names as a concatenation
> of the attribute and type ("quality_phred").
> 
> Treating these specially on the BioSQL in/out is a little hack-y,
> but quality is likely important enough to not bury it.
> 
> For Leighton's idea of generalization you could either:
> 
> - Derive a heavy-weight SeqRecord class from the base class that
>   added a several additional per-symbol cases.
> 
> - Provide a generic per_symbol_annotations attribute that collected
>   these as a dictionary of dictionaries:
> 
>   dict(quality = dict(phred = [20, 30]),
>        hydrophobicity = dict(some_predictor = ['some', 'scores'])
>       )
> 
> These could map to generic attributes in the same way and follow the
> same slicing rules. After writing this up, I think the second idea
> is better and probably exactly what Leighton was proposing.
> 
> Brad
> 
> > Another 2p... I collect them, you know...
> > 
> > An additional determinant of how these values are best scored is: "What will
> > they be used for?".
> > 
> > If the only use they would ever find was to accompany a sequence so that its
> > file format could be converted from one with embedded qualities to a format
> > that required two such files (or vice-versa), then straightforward storage
> > as a string in a dictionary is all that's needed.  This would be sufficient
> > for conversion between some quality scores, as a utility function could just
> > grab the stored string (given an appropriate name for each quality format).
> > The question of how these per-symbol annotations would be modified when
> > returning a Seq slice or join may be an issue.
> > 
> > If 'live' access to the values is required for calculation or alignment
> > purposes, then a different interface might be more useful, permitting
> > slicing, base selection on the basis of quality, or other operation.  This
> > use case is more complex, as the return value is likely to be dependent on
> > the quality format (single- or multiple-value per base).
> > 
> > Conceptually, I see quality scores as annotations of a sequence, rather than
> > an intrinsic property of the sequence, so am happy for them to live in the
> > same place other annotations do.  I also see them as only one instance of a
> > class of per-symbol annotations (along with hydrophobicity scores, secondary
> > structure predictions, read map counts and several other measures).  I
> > think, therefore, that there is a case for a class describing per-symbol
> > annotations to a Seq, and placing these in a dictionary of per-symbol
> > annotations.  Slices of the parent Seq could then be propagated downwards to
> > all members of that dictionary (which would also be expected to implement
> > the same string-like methods as the parent).
> > 
> > The per-symbol annotation objects could be subclassed and/or contain a
> > descriptive string from a controlled vocabulary to indicate their format,
> > for standard interfacing with external packages (e.g. Drawing TOPS diagrams
> > from secondary structure predictions or rendering base quality profiles),
> > which I think would be a flexible approach.
> > 
> > On 20/02/2009 11:49, "Jose Blanca" <jblanca at btc.upv.es> wrote:
> > 
> > >> I suppose you could consider adding a .phred_quality
> > >> property which is explicit, but then you'd end up with many different
> > >> properties.  Then there are other per-letter quality annotations - you
> > >> might want the A, C, G and T intensity from capillary sequencing (four
> > >> sets of numbers, not just one).  Plus of course this doesn't address
> > >> non-quality related per-letter-annotations (like secondary structure,
> > >> or atomic coordinates).
> > >> 
> > >> My point is that if we can't give top level properties to everything,
> > >> hence the original introduction of the annotations dictionary in the
> > >> first place.  Only a handful of really important things got their own
> > >> properties (id, name, description and the sequence itself).  If there
> > >> was only ONE key quality score, then I wouldn't mind making an
> > >> exception so much - but that doesn't seem to be the case.
> > > That's a very good point. It wouldn't be wise to populate the SeqRecord class
> > > with a lot of properties.
> > > Another posible approach would be to create a derived class for that a
> > > SeqWithQuality. It would be like a SeqRecord but with a .quality property.
> > > For other cases other classes could be derived from SeqRecord.
> > > The problem with putting the quatilies in a dict with all the other per base
> > > annotation is that it has a different behaviour than the .seq case. The seq
> > > case is special because is much more used, so maybe that's fair enough.
> > > I don't know, maybe it is wiser to set all the per case annotations in a dict
> > > a let the sequence outside. In that way we won't be creating a lot of new
> > > classes derived from SeqRecord.
> > > The more I think about the dict possibility, the more I like it.
> > 
> > -- 
> > Dr Leighton Pritchard MRSC
> > D131, Plant Pathology Programme, SCRI
> > Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
> > e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
> > gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405
> > 
> > 
> > ______________________________________________________________________
> > SCRI, Invergowrie, Dundee, DD2 5DA.  
> > The Scottish Crop Research Institute is a charitable company limited by
> > guarantee. 
> > Registered in Scotland No: SC 29367.
> > Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.
> > 
> > 
> > DISCLAIMER:
> > 
> > This email is from the Scottish Crop Research Institute, but the views 
> > expressed by the sender are not necessarily the views of SCRI and its 
> > subsidiaries.  This email and any files transmitted with it are
> > confidential
> > 
> > to the intended recipient at the e-mail address to which it has been 
> > addressed.  It may not be disclosed or used by any other than that
> > addressee.
> > If you are not the intended recipient you are requested to preserve this
> > 
> > confidentiality and you must not use, disclose, copy, print or rely on
> > this 
> > e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
> > name of the sender and delete the email from your system.
> > 
> > Although SCRI has taken reasonable precautions to ensure no viruses are 
> > present in this email, neither the Institute nor the sender accepts any 
> > responsibility for any viruses, and it is your responsibility to scan
> > the email and the attachments (if any).
> > ______________________________________________________________________
> > _______________________________________________
> > Biopython-dev mailing list
> > Biopython-dev at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython-dev
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
-- 
Iddo Friedberg, Ph.D.
CALIT2 Atkinson Hall MC #0446
University of California San Diego
9500 Gilman Drive
La Jolla, CA 92093-0446 USA
+1 (858) 534-0570
http://iddo-friedberg.org