[Bioperl-l] Re: Frameshifts in alignments ... ?
Ewan Birney
birney@ebi.ac.uk
Thu, 5 Sep 2002 08:07:49 +0100 (BST)
On Wed, 4 Sep 2002, Hilmar Lapp wrote:
> Wouldn't we want the Transcript and Exon classes to utilize this too
> for the cds() method, or is something in this mechanism that makes
> it specific to alignments?
You have to handle gaps as well.
>
> -hilmar
>
> On Wednesday, September 4, 2002, at 08:00 AM, Aaron J Mackey wrote:
>
> >
> > Ewan wrote:
> >
> >> Remember that the "encoding" is as well as the bases, ie, one
> >> effectively
> >> has two "tracks", being
> >>
> >> CCCCCCCCCCCIIIIIIIIIIIIIIIIIIIIIIICCCCCGGGCCCC
> >> ATGGGTGTATGTATTGTGTAAAAAGAATGTTAAGGTTGT---GTET
> >>
> >>
> >> I am happy to get into this. I would propose the following encodings:
> >>
> >> C - indicates is part of a codon (do we need C = 1st base of
> >> codon, D = 2nd base of codon, E = 3rd base of codon?)
> >
> > I like the "CDE", myself; users can supply an encoded sequence with
> > just
> > C's though.
> >
> >> I - intron (also could go for I = phase 0 intron, J = phase 1
> >> intron, K = phase 2 intron)
> >
> > Ditto.
> >
> >> F - frameshift indel (ie, sequencing error)
> >
> > Frameshifts come in a few varieties: "forward" (i.e. we skip an extra
> > nucleotide, and translate the next full codon) or "backward" (i.e.
> > we're
> > missing a nucleotide, so we only use the adjacent two nucleotides
> > for the
> > codon). For example, using '\' and '/' for forward and backwards
> > frameshifts (whitespace added for clarity):
> >
> > ATG GCTA GCT A-C GAC TAC
> > CCC C\CC CCC C/C CCC CCC
> >
> > leads to a CDS of:
> >
> > ATG GTA GCT ANC GAC TAC
> >
> > So, in a sense, a backward frameshift is a bit like a gap, where a
> > forward
> > frameshift is a bit like a one-nucleotide intron. Regardless, I
> > need two
> > symbols, F and B maybe? Just nod if this works for you.
> >
> >> G - gap, implicitly protein coding
> >>
> >> H - gap, in "other" dna sequence
> >
> > These are protein indel parallels to F and B, right? i.e.:
> >
> > ATGGCT---GTATGT
> > CCCCCCGGGCCCHHH
> >
> > would this then include the last codon TGT in the CDS? If so, how does
> > HHH differ from CCC, in terms of influencing the encoded protein
> > sequence?
> > If it's only in terms of expressing the presence of a gap in another
> > sequence, then I think that's better handled in that other sequence
> > (together in a SimpleAlign or similar object). This is still a bit
> > fuzzy
> > for me ...
> >
> >> U - UTR (or U - 5'UTR, V - 3'UTR)
> >
> > Ditto as above for CDE and IJK; U and V will be used by us, but the
> > user
> > can just say U (and if we know strand, we can work out what's what).
> >
> > OK, then here is my proposed functionality; please read carefully
> > as this
> > is exactly what you'll get, unless someone raises a little noise soon:
> >
> >
> > package Bio::EncodedSeq;
> >
> > use strict;
> > use Bio::LocatableSeq;
> >
> > @ISA = qw(Bio::LocatableSeq);
> >
> > =head2 new
> > Title : new
> > Usage : $obj = Bio::EncodedSeq->new(-dnaseq => "AGTACGTGTCATG",
> > -encoding => "CCCCCCFCCCCCC",
> > -id => "myseq",
> > -start => 1,
> > -end => 13,
> > -strand => 1
> > );
> > Function: creates a new Bio::EncodedSeq object from a supplied DNA
> > sequence
> > Returns : a new Bio::EncodedSeq object
> > Args : dnaseq - primary nucleotide sequence used to encode the
> > protein
> > encoding - a string of characters (see Encoding Table)
> > describing backwards frameshifts implied by the
> > encoding but not present in the sequence will be
> > added (as '-'s) to the sequence. If not
> > supplied, it will be assumed that all positions
> > are coding (C). Encoding may include either
> > implicit phase encoding characters (i.e. "CCC")
> > and/or explicit encoding characters (i.e. "CDE").
> > Alternatively, encoding may be a hashref
> > datastructure, with encoding characters as keys
> > and Bio::LocationI objects (or arrayrefs of
> > Bio::LocationI objects) as values, e.g.:
> > { C => [ Bio::Location::Simple->new(1,9),
> > Bio::Location::Simple->new(11,13) ],
> > F => Bio::Location::Simple->new(10,10),
> > } # same as "CCCCCCCCCFCCC"
> > id, start, end, strand - as with Bio::LocatableSeq; note
> > that the coordinates are relative to the
> > encoding DNA sequence, not the implicit protein
> > sequence.
> > =cut
> >
> > =head2 encoding
> > Title : encoding
> > Usage : $obj->encoding("CCCCCC");
> > $obj->encoding( -encoding => { I => $location } );
> > $enc = $obj->encoding(-explicit => 1);
> > $enc = $obj->encoding("CCCCCC", -explicit => 1);
> > $enc = $obj->encoding(-location => $location,
> > -explicit => 1 );
> > Function: get/set the objects encoding, either globally or by
> > location(s).
> > Returns : the (possibly new) encoding string.
> > Args : encoding - see the encoding argument to the new() function.
> > explicit - whether or not to return explicit phase
> > information in the coding (i.e. "CCC" becomes
> > "CDE", "III" becomes "IJK", etc); defaults to 0.
> > location - optional; location to get/set the encoding.
> > Defaults to the entire sequence.
> > =cut
> >
> > =head2 cds
> > Title : cds
> > Usage : $cds = $obj->cds();
> > Function: obtain the "spliced" DNA sequence, by removing any
> > nucleotides that participate in an UTR, forward frameshift
> > or intron, and replacing any unknown nucleotide implied by
> > a backward frameshift or gap with N's.
> > Returns : a Bio::EncodedSeq object, with an encoding consisting only
> > of "CCCC..".
> > Args : none.
> > =cut
> >
> > =head2 translate
> > Title : translate
> > Usage : $prot = $obj->translate(@args);
> > Function: obtain the protein sequence encoded by the underlying DNA
> > sequence; same as $obj->cds()->translate(@args).
> > Returns : a Bio::PrimarySeq object.
> > Args : same as the translate() function of Bio::PrimarySeqI
> > =cut
> >
> > =head2 seq
> > Title : seq
> > Usage : $protseq = $obj->seq();
> > Function: obtain the raw protein sequence encoded by the underlying
> > DNA sequence; This is the same as calling
> > $obj->translate()->seq();
> > Returns : a string of single-letter amino acid codes
> > Args : same as the seq() function of Bio::PrimarySeq; note that
> > this
> > function may not be used to set the protein sequence; see
> > the dnaseq() function for that.
> > =cut
> >
> > =head2 dnaseq
> > Title : dnaseq
> > Usage : $dnaseq = $obj->dnaseq();
> > $obj->dnaseq("ACGTGTCGT", "CCCCCCCCC");
> > $obj->dnaseq(-dnaseq => "ATG",
> > -encoding => "CCC",
> > -location => $loc );
> > Function: get/set the underlying DNA sequence; will overwrite any
> > current DNA and/or encoding information present.
> > Returns : a string of single-letter nucleotide codes, including any
> > gaps implied by the encoding.
> > Args : dnaseq - the DNA sequence to be used as a replacement
> > encoding - the encoding of the DNA sequence (see the new()
> > constructor); defaults to all 'C'.
> > location - optional, the location of the DNA sequence to
> > get/set; defaults to the entire sequence.
> > =cut
> >
> > [ and all the inherited Bio::LocatableSeq and Bio::PrimarySeqI
> > methods; note that the coordinates of those methods will refer only to
> > the underlying DNA sequence, not the implicit encoded protein sequence
> > - my next task will be to extend Ewan and Heikki's Bio::Coordinate
> > system to include Bio::Coordinate::EncodedPair so that conversions can
> > be made more easily ... any comments on that? ]
> >
> > thanks for reading,
> >
> > -Aaron
> >
> > --
> > Aaron J Mackey
> > Pearson Laboratory
> > University of Virginia
> > (434) 924-2821
> > amackey@virginia.edu
> >
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
> >
> --
> -------------------------------------------------------------
> Hilmar Lapp email: lapp at gnf.org
> GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
> -------------------------------------------------------------
>
>
-----------------------------------------------------------------
Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
<birney@ebi.ac.uk>.
-----------------------------------------------------------------