[Bioperl-l] Re: Frameshifts in alignments ... ?

Ewan Birney birney@ebi.ac.uk
Thu, 5 Sep 2002 08:07:49 +0100 (BST)


On Wed, 4 Sep 2002, Hilmar Lapp wrote:

> Wouldn't we want the Transcript and Exon classes to utilize this too 
> for the cds() method, or is something in this mechanism that makes 
> it specific to alignments?


You have to handle gaps as well. 





> 
> 	-hilmar
> 
> On Wednesday, September 4, 2002, at 08:00  AM, Aaron J Mackey wrote:
> 
> >
> > Ewan wrote:
> >
> >> Remember that the "encoding" is as well as the bases, ie, one 
> >> effectively
> >> has two "tracks", being
> >>
> >>   CCCCCCCCCCCIIIIIIIIIIIIIIIIIIIIIIICCCCCGGGCCCC
> >>   ATGGGTGTATGTATTGTGTAAAAAGAATGTTAAGGTTGT---GTET
> >>
> >>
> >> I am happy to get into this. I would propose the following encodings:
> >>
> >>  C - indicates is part of a codon (do we need C = 1st base of 
> >> codon, D = 2nd base of codon, E = 3rd base of codon?)
> >
> > I like the "CDE", myself; users can supply an encoded sequence with 
> > just
> > C's though.
> >
> >>  I - intron (also could go for I = phase 0 intron, J = phase 1 
> >> intron, K = phase 2 intron)
> >
> > Ditto.
> >
> >>  F - frameshift indel (ie, sequencing error)
> >
> > Frameshifts come in a few varieties: "forward" (i.e. we skip an extra
> > nucleotide, and translate the next full codon) or "backward" (i.e. 
> > we're
> > missing a nucleotide, so we only use the adjacent two nucleotides 
> > for the
> > codon).  For example, using '\' and '/' for forward and backwards
> > frameshifts (whitespace added for clarity):
> >
> > ATG GCTA GCT A-C GAC TAC
> > CCC C\CC CCC C/C CCC CCC
> >
> > leads to a CDS of:
> >
> > ATG GTA  GCT ANC GAC TAC
> >
> > So, in a sense, a backward frameshift is a bit like a gap, where a 
> > forward
> > frameshift is a bit like a one-nucleotide intron.  Regardless, I 
> > need two
> > symbols, F and B maybe?  Just nod if this works for you.
> >
> >>  G - gap, implicitly protein coding
> >>
> >>  H - gap, in "other" dna sequence
> >
> > These are protein indel parallels to F and B, right?  i.e.:
> >
> > ATGGCT---GTATGT
> > CCCCCCGGGCCCHHH
> >
> > would this then include the last codon TGT in the CDS?  If so, how does
> > HHH differ from CCC, in terms of influencing the encoded protein 
> > sequence?
> > If it's only in terms of expressing the presence of a gap in another
> > sequence, then I think that's better handled in that other sequence
> > (together in a SimpleAlign or similar object).  This is still a bit 
> > fuzzy
> > for me ...
> >
> >>  U - UTR (or U - 5'UTR, V - 3'UTR)
> >
> > Ditto as above for CDE and IJK; U and V will be used by us, but the 
> > user
> > can just say U (and if we know strand, we can work out what's what).
> >
> > OK, then here is my proposed functionality; please read carefully 
> > as this
> > is exactly what you'll get, unless someone raises a little noise soon:
> >
> >
> > package Bio::EncodedSeq;
> >
> > use strict;
> > use Bio::LocatableSeq;
> >
> > @ISA = qw(Bio::LocatableSeq);
> >
> > =head2 new
> >  Title   : new
> >  Usage   : $obj = Bio::EncodedSeq->new(-dnaseq   => "AGTACGTGTCATG",
> >                                        -encoding => "CCCCCCFCCCCCC",
> >                                        -id       => "myseq",
> >                                        -start    => 1,
> >                                        -end      => 13,
> >                                        -strand   => 1
> >                                       );
> >  Function: creates a new Bio::EncodedSeq object from a supplied DNA
> >            sequence
> >  Returns : a new Bio::EncodedSeq object
> >  Args    : dnaseq   - primary nucleotide sequence used to encode the
> >                       protein
> >            encoding - a string of characters (see Encoding Table)
> >                       describing backwards frameshifts implied by the
> >                       encoding but not present in the sequence will be
> >                       added (as '-'s) to the sequence.  If not
> >                       supplied, it will be assumed that all positions
> >                       are coding (C).  Encoding may include either
> >                       implicit phase encoding characters (i.e. "CCC")
> >                       and/or explicit encoding characters (i.e. "CDE").
> >                       Alternatively, encoding may be a hashref
> >                       datastructure, with encoding characters as keys
> >                       and Bio::LocationI objects (or arrayrefs of
> >                       Bio::LocationI objects) as values, e.g.:
> >                       { C => [ Bio::Location::Simple->new(1,9),
> >                                Bio::Location::Simple->new(11,13) ],
> >                         F => Bio::Location::Simple->new(10,10),
> >                       } # same as "CCCCCCCCCFCCC"
> >            id, start, end, strand - as with Bio::LocatableSeq; note
> >                       that the coordinates are relative to the
> >                       encoding DNA sequence, not the implicit protein
> >                       sequence.
> > =cut
> >
> > =head2 encoding
> >  Title   : encoding
> >  Usage   : $obj->encoding("CCCCCC");
> >            $obj->encoding( -encoding => { I => $location } );
> >            $enc = $obj->encoding(-explicit => 1);
> >            $enc = $obj->encoding("CCCCCC", -explicit => 1);
> >            $enc = $obj->encoding(-location => $location,
> >                                  -explicit => 1 );
> >  Function: get/set the objects encoding, either globally or by 
> > location(s).
> >  Returns : the (possibly new) encoding string.
> >  Args    : encoding - see the encoding argument to the new() function.
> >            explicit - whether or not to return explicit phase
> >                       information in the coding (i.e. "CCC" becomes
> >                       "CDE", "III" becomes "IJK", etc); defaults to 0.
> >            location - optional; location to get/set the encoding.
> >                       Defaults to the entire sequence.
> > =cut
> >
> > =head2 cds
> >  Title   : cds
> >  Usage   : $cds = $obj->cds();
> >  Function: obtain the "spliced" DNA sequence, by removing any
> >            nucleotides that participate in an UTR, forward frameshift
> >            or intron, and replacing any unknown nucleotide implied by
> >            a backward frameshift or gap with N's.
> >  Returns : a Bio::EncodedSeq object, with an encoding consisting only
> >            of "CCCC..".
> >  Args    : none.
> > =cut
> >
> > =head2 translate
> >  Title   : translate
> >  Usage   : $prot = $obj->translate(@args);
> >  Function: obtain the protein sequence encoded by the underlying DNA
> >            sequence; same as $obj->cds()->translate(@args).
> >  Returns : a Bio::PrimarySeq object.
> >  Args    : same as the translate() function of Bio::PrimarySeqI
> > =cut
> >
> > =head2 seq
> >  Title   : seq
> >  Usage   : $protseq = $obj->seq();
> >  Function: obtain the raw protein sequence encoded by the underlying
> >            DNA sequence; This is the same as calling
> >            $obj->translate()->seq();
> >  Returns : a string of single-letter amino acid codes
> >  Args :    same as the seq() function of Bio::PrimarySeq; note that 
> > this
> >            function may not be used to set the protein sequence; see
> >            the dnaseq() function for that.
> > =cut
> >
> > =head2 dnaseq
> >  Title   : dnaseq
> >  Usage   : $dnaseq = $obj->dnaseq();
> >            $obj->dnaseq("ACGTGTCGT", "CCCCCCCCC");
> >            $obj->dnaseq(-dnaseq => "ATG",
> >                         -encoding => "CCC",
> >                         -location => $loc );
> >  Function: get/set the underlying DNA sequence; will overwrite any
> >            current DNA and/or encoding information present.
> >  Returns : a string of single-letter nucleotide codes, including any
> >            gaps implied by the encoding.
> >  Args    : dnaseq   - the DNA sequence to be used as a replacement
> >            encoding - the encoding of the DNA sequence (see the new()
> >                       constructor); defaults to all 'C'.
> >            location - optional, the location of the DNA sequence to
> >                       get/set; defaults to the entire sequence.
> > =cut
> >
> > [ and all the inherited Bio::LocatableSeq and Bio::PrimarySeqI
> > methods; note that the coordinates of those methods will refer only to
> > the underlying DNA sequence, not the implicit encoded protein sequence
> > - my next task will be to extend Ewan and Heikki's Bio::Coordinate
> > system to include Bio::Coordinate::EncodedPair so that conversions can
> > be made more easily ... any comments on that? ]
> >
> > thanks for reading,
> >
> > -Aaron
> >
> > --
> >  Aaron J Mackey
> >  Pearson Laboratory
> >  University of Virginia
> >  (434) 924-2821
> >  amackey@virginia.edu
> >
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
> >
> --
> -------------------------------------------------------------
> Hilmar Lapp                            email: lapp at gnf.org
> GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
> -------------------------------------------------------------
> 
> 

-----------------------------------------------------------------
Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
<birney@ebi.ac.uk>. 
-----------------------------------------------------------------