[Bioperl-l] not all sequence is created equal (base quality data)

Paul Gordon gordonp@niji.imb.nrc.ca
Wed, 27 Jun 2001 20:22:17 -0300 (ADT)


> > This came up because I started playing with pir data and we can eaily make
> > it work except for the fact that some PIR files have quality information
> > about their bases, embedded in the sequence (probably not the best way
> > to do this...)
> > 
> > >P1;CCDG
> > cytochrome c - dog (tentative sequence)
> > GDVEKGKKIFVQK(C.A.Q.C.H.T.V.E)KGGKHKTGPNLHGLFGRKTGQAPGFSYTDANKNKGITWGEETLMEYLENP
> > KKYIPGTKMIFAGIKKTGERADLIAYLKKATKE*
> > 
> > Looking at their coding table (+) this is oh so much fun to try and code
> > for...  I can at least strip out this quality data for now to allow us to
> > read in pir files, but it would be very interesting if we COULD integrate
> > quality data into the sequence object. If we wanted to be able to read in
> > the sequence read quality values.
> > 
> > 
> > (+) 
> > Table II: Punctuation Description in Protein Sequences
<snip />
As near as I can tell, PIR uses the IUPAC notation.  It would probably
not be a bad idea to be able to parse IUPAC peptide sequences in general
(though I'm not volunteering right now :-)). The specification can be
found at:

http://www.chem.qmw.ac.uk/iupac/AminoAcid/A2021.html#AA215

________________________________________________________________________
Paul Gordon                                     Paul.Gordon@nrc.ca
Genomic Technologies				http://maggie.cbr.nrc.ca
Institute for Marine Biosciences
National Research Council Canada