[Bioperl-l] not all sequence is created equal (base quality data)

Wed, 27 Jun 2001 16:05:25 -0400 (EDT)

It would obviously be of interest to our friends doing sequencing as
well as our friends doing prediction and other analysis who want to
weigh low quality sequence less if we could incorperate base quality
information into the idea of Sequence somehow.

Could we architect a design to handle this and have quality values paired
with bases?  

I can imagine a couple of ways to do this
 - an additional data field in PrimarySeq object, 
 - a parallel Seq::Quality object paired with a PrimarySeq object
 - a SeqFeature which spanned the entire sequence and had the
   primary tag 'quality' and a value of the sequence quality.

None of these seem particularly elegant, but the BaseWithQualityScores object
(biojava) is not going out work very well in perl either.

Anyone have ideas on this?   This is something I think that would be worthy to
consider as a project for 1.0 if anyone else agrees.  

This came up because I started playing with pir data and we can eaily make
it work except for the fact that some PIR files have quality information
about their bases, embedded in the sequence (probably not the best way
to do this...)

>P1;CCDG
cytochrome c - dog (tentative sequence)
GDVEKGKKIFVQK(C.A.Q.C.H.T.V.E)KGGKHKTGPNLHGLFGRKTGQAPGFSYTDANKNKGITWGEETLMEYLENP
KKYIPGTKMIFAGIKKTGERADLIAYLKKATKE*

Looking at their coding table (+) this is oh so much fun to try and code
for...  I can at least strip out this quality data for now to allow us to
read in pir files, but it would be very interesting if we COULD integrate
quality data into the sequence object. If we wanted to be able to read in
the sequence read quality values.

(+) 
Table II: Punctuation Description in Protein Sequences

XX   Two adjacent amino acids, with no punctuation between, indicates
       that they are connected, as determined experimentally.
()   Encloses a region, the composition but not the complete sequence
       of which has been determined experimentally, or encloses a
       single residue that has been tentatively identified.
 =   Indicates ")(", the juxtaposition of two regions of indeterminate
       sequence, while preserving proper spacing between amino acids.
 /   Indicates that the adjacent amino acids are from different
       peptides, not necessarily connected. When the amino end of a
       protein has not been determined, "/" precedes the first residue.
       When the carboxyl end has not been determined, "/" follows the
       last residue. When ")/", "/(", or ")/(" are needed, only "/" is
       used.
 .  Outside of parentheses, indicates the ends of sequence fragments.
       The relative order of these fragments was not determined
       experimentally but is clear from homology or other indirect
       evidence.
 .  Within parentheses, indicates that the amino acid to the left
       has been placed with at least 90% confidence by homology with
       known sequences.
 ,  Indicates that the amino acid to its left could not be
       positioned with confidence by homology.

Jason Stajich
jason@chg.mc.duke.edu
Center for Human Genetics
Duke University Medical Center 
http://www.chg.duke.edu/