[Bioperl-l] not all sequence is created equal (base quality data)

David Block dblock@gene.pbi.nrc.ca
Wed, 27 Jun 2001 14:39:47 -0600 (CST)


This would be fun and quite useful here.  I'll pass it along to some
coders who may join the fray...

On Wed, 27 Jun 2001, Jason Stajich wrote:

> It would obviously be of interest to our friends doing sequencing as
> well as our friends doing prediction and other analysis who want to
> weigh low quality sequence less if we could incorperate base quality
> information into the idea of Sequence somehow.
> 
> Could we architect a design to handle this and have quality values paired
> with bases?  
> 
> I can imagine a couple of ways to do this
>  - an additional data field in PrimarySeq object, 
>  - a parallel Seq::Quality object paired with a PrimarySeq object
>  - a SeqFeature which spanned the entire sequence and had the
>    primary tag 'quality' and a value of the sequence quality.
> 
> None of these seem particularly elegant, but the BaseWithQualityScores object
> (biojava) is not going out work very well in perl either.
> 
> Anyone have ideas on this?   This is something I think that would be worthy to
> consider as a project for 1.0 if anyone else agrees.  
> 
> This came up because I started playing with pir data and we can eaily make
> it work except for the fact that some PIR files have quality information
> about their bases, embedded in the sequence (probably not the best way
> to do this...)
> 
> >P1;CCDG
> cytochrome c - dog (tentative sequence)
> GDVEKGKKIFVQK(C.A.Q.C.H.T.V.E)KGGKHKTGPNLHGLFGRKTGQAPGFSYTDANKNKGITWGEETLMEYLENP
> KKYIPGTKMIFAGIKKTGERADLIAYLKKATKE*
> 
> Looking at their coding table (+) this is oh so much fun to try and code
> for...  I can at least strip out this quality data for now to allow us to
> read in pir files, but it would be very interesting if we COULD integrate
> quality data into the sequence object. If we wanted to be able to read in
> the sequence read quality values.
> 
> 
> (+) 
> Table II: Punctuation Description in Protein Sequences
> 
> XX   Two adjacent amino acids, with no punctuation between, indicates
>        that they are connected, as determined experimentally.
> ()   Encloses a region, the composition but not the complete sequence
>        of which has been determined experimentally, or encloses a
>        single residue that has been tentatively identified.
>  =   Indicates ")(", the juxtaposition of two regions of indeterminate
>        sequence, while preserving proper spacing between amino acids.
>  /   Indicates that the adjacent amino acids are from different
>        peptides, not necessarily connected. When the amino end of a
>        protein has not been determined, "/" precedes the first residue.
>        When the carboxyl end has not been determined, "/" follows the
>        last residue. When ")/", "/(", or ")/(" are needed, only "/" is
>        used.
>  .  Outside of parentheses, indicates the ends of sequence fragments.
>        The relative order of these fragments was not determined
>        experimentally but is clear from homology or other indirect
>        evidence.
>  .  Within parentheses, indicates that the amino acid to the left
>        has been placed with at least 90% confidence by homology with
>        known sequences.
>  ,  Indicates that the amino acid to its left could not be
>        positioned with confidence by homology.
> 
> 
> 
> Jason Stajich
> jason@chg.mc.duke.edu
> Center for Human Genetics
> Duke University Medical Center 
> http://www.chg.duke.edu/ 
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
> 

-- 
David Block
dblock@gene.pbi.nrc.ca
http://bioinfo.pbi.nrc.ca/dblock/wiki
NRC Plant Biotechnology Institute
Saskatoon, SK, Canada