[Bioperl-l] not all sequence is created equal (base quality d
ata)
Malcolm Cook
mcook@dna.com
Wed, 27 Jun 2001 14:34:00 -0700
Jason,
Perhaps another way to go with this:
- an additional abstract method on Seq which took a location and returned a
'quality'
By the way, what would be a 'quality' - a new object? a single number? a
z-score, a p-value? i've seen : real[0,1] , int[1,10].
Regards
>-----Original Message-----
>From: Jason Stajich [mailto:jason@chg.mc.duke.edu]
>Sent: Wednesday, June 27, 2001 1:05 PM
>To: Bioperl
>Subject: [Bioperl-l] not all sequence is created equal (base quality
>data)
>
>
>It would obviously be of interest to our friends doing sequencing as
>well as our friends doing prediction and other analysis who want to
>weigh low quality sequence less if we could incorperate base quality
>information into the idea of Sequence somehow.
>
>Could we architect a design to handle this and have quality
>values paired
>with bases?
>
>I can imagine a couple of ways to do this
> - an additional data field in PrimarySeq object,
> - a parallel Seq::Quality object paired with a PrimarySeq object
> - a SeqFeature which spanned the entire sequence and had the
> primary tag 'quality' and a value of the sequence quality.
>
>None of these seem particularly elegant, but the
>BaseWithQualityScores object
>(biojava) is not going out work very well in perl either.
>
>Anyone have ideas on this? This is something I think that
>would be worthy to
>consider as a project for 1.0 if anyone else agrees.
>
>This came up because I started playing with pir data and we
>can eaily make
>it work except for the fact that some PIR files have quality
>information
>about their bases, embedded in the sequence (probably not the best way
>to do this...)
>
>>P1;CCDG
>cytochrome c - dog (tentative sequence)
>GDVEKGKKIFVQK(C.A.Q.C.H.T.V.E)KGGKHKTGPNLHGLFGRKTGQAPGFSYTDANKN
>KGITWGEETLMEYLENP
>KKYIPGTKMIFAGIKKTGERADLIAYLKKATKE*
>
>Looking at their coding table (+) this is oh so much fun to
>try and code
>for... I can at least strip out this quality data for now to
>allow us to
>read in pir files, but it would be very interesting if we
>COULD integrate
>quality data into the sequence object. If we wanted to be able
>to read in
>the sequence read quality values.
>
>
>(+)
>Table II: Punctuation Description in Protein Sequences
>
>XX Two adjacent amino acids, with no punctuation between, indicates
> that they are connected, as determined experimentally.
>() Encloses a region, the composition but not the complete sequence
> of which has been determined experimentally, or encloses a
> single residue that has been tentatively identified.
> = Indicates ")(", the juxtaposition of two regions of indeterminate
> sequence, while preserving proper spacing between amino acids.
> / Indicates that the adjacent amino acids are from different
> peptides, not necessarily connected. When the amino end of a
> protein has not been determined, "/" precedes the first residue.
> When the carboxyl end has not been determined, "/" follows the
> last residue. When ")/", "/(", or ")/(" are needed, only "/" is
> used.
> . Outside of parentheses, indicates the ends of sequence fragments.
> The relative order of these fragments was not determined
> experimentally but is clear from homology or other indirect
> evidence.
> . Within parentheses, indicates that the amino acid to the left
> has been placed with at least 90% confidence by homology with
> known sequences.
> , Indicates that the amino acid to its left could not be
> positioned with confidence by homology.
>
>
>
>Jason Stajich
>jason@chg.mc.duke.edu
>Center for Human Genetics
>Duke University Medical Center
>http://www.chg.duke.edu/
>
>
>_______________________________________________
>Bioperl-l mailing list
>Bioperl-l@bioperl.org
>http://bioperl.org/mailman/listinfo/bioperl-l
>