[Bioperl-l] not all sequence is created equal (base quality
data)
David Block
dblock@gene.pbi.nrc.ca
Wed, 27 Jun 2001 14:39:47 -0600 (CST)
This would be fun and quite useful here. I'll pass it along to some
coders who may join the fray...
On Wed, 27 Jun 2001, Jason Stajich wrote:
> It would obviously be of interest to our friends doing sequencing as
> well as our friends doing prediction and other analysis who want to
> weigh low quality sequence less if we could incorperate base quality
> information into the idea of Sequence somehow.
>
> Could we architect a design to handle this and have quality values paired
> with bases?
>
> I can imagine a couple of ways to do this
> - an additional data field in PrimarySeq object,
> - a parallel Seq::Quality object paired with a PrimarySeq object
> - a SeqFeature which spanned the entire sequence and had the
> primary tag 'quality' and a value of the sequence quality.
>
> None of these seem particularly elegant, but the BaseWithQualityScores object
> (biojava) is not going out work very well in perl either.
>
> Anyone have ideas on this? This is something I think that would be worthy to
> consider as a project for 1.0 if anyone else agrees.
>
> This came up because I started playing with pir data and we can eaily make
> it work except for the fact that some PIR files have quality information
> about their bases, embedded in the sequence (probably not the best way
> to do this...)
>
> >P1;CCDG
> cytochrome c - dog (tentative sequence)
> GDVEKGKKIFVQK(C.A.Q.C.H.T.V.E)KGGKHKTGPNLHGLFGRKTGQAPGFSYTDANKNKGITWGEETLMEYLENP
> KKYIPGTKMIFAGIKKTGERADLIAYLKKATKE*
>
> Looking at their coding table (+) this is oh so much fun to try and code
> for... I can at least strip out this quality data for now to allow us to
> read in pir files, but it would be very interesting if we COULD integrate
> quality data into the sequence object. If we wanted to be able to read in
> the sequence read quality values.
>
>
> (+)
> Table II: Punctuation Description in Protein Sequences
>
> XX Two adjacent amino acids, with no punctuation between, indicates
> that they are connected, as determined experimentally.
> () Encloses a region, the composition but not the complete sequence
> of which has been determined experimentally, or encloses a
> single residue that has been tentatively identified.
> = Indicates ")(", the juxtaposition of two regions of indeterminate
> sequence, while preserving proper spacing between amino acids.
> / Indicates that the adjacent amino acids are from different
> peptides, not necessarily connected. When the amino end of a
> protein has not been determined, "/" precedes the first residue.
> When the carboxyl end has not been determined, "/" follows the
> last residue. When ")/", "/(", or ")/(" are needed, only "/" is
> used.
> . Outside of parentheses, indicates the ends of sequence fragments.
> The relative order of these fragments was not determined
> experimentally but is clear from homology or other indirect
> evidence.
> . Within parentheses, indicates that the amino acid to the left
> has been placed with at least 90% confidence by homology with
> known sequences.
> , Indicates that the amino acid to its left could not be
> positioned with confidence by homology.
>
>
>
> Jason Stajich
> jason@chg.mc.duke.edu
> Center for Human Genetics
> Duke University Medical Center
> http://www.chg.duke.edu/
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>
--
David Block
dblock@gene.pbi.nrc.ca
http://bioinfo.pbi.nrc.ca/dblock/wiki
NRC Plant Biotechnology Institute
Saskatoon, SK, Canada