[Bioperl-l] Re: Frameshifts in alignments ... ?

Ewan Birney birney@ebi.ac.uk
Tue, 3 Sep 2002 14:37:04 +0100 (BST)


On Tue, 3 Sep 2002, Aaron J Mackey wrote:

[snip]

BTW - we should call them  Bio::Seq::EncodedSequence

> 
> I. A richer Bio::LocatableSeq (perhaps renamed ... Bio::EncodingSeq ??) to
> which SeqFeatures that represent introns and frameshifts can be added; and
> perhaps the following additional methods, to begin:
> 



Hmmm.... This is an interesting route.


Remember that the "encoding" is as well as the bases, ie, one effectively
has two "tracks", being

   CCCCCCCCCCCIIIIIIIIIIIIIIIIIIIIIIICCCCCGGGCCCC
   ATGGGTGTATGTATTGTGTAAAAAGAATGTTAAGGTTGT---GTET


This is in fact v. similar as the "label" idea inside Wise2 but it being
more efficiently stored. 

Notice that Gaps (as they don't exist in the sequence) have their own
"label" and their own gap character.


I am happy to get into this. I would propose the following encodings:


  C - indicates is part of a codon (do we need C = 1st base of codon, D =
2nd base of codon, E = 3rd base of codon?)

  I - intron (also could go for I = phase 0 intron, J = phase 1 intron, K
= phase 2 intron)

  F - frameshift indel (ie, sequencing error)

  G - gap, implicitly protein coding

  H - gap, in "other" dna sequence

  U - UTR (or U - 5'UTR, V - 3'UTR)



I could adapt genewise to directly output this stuff.



Are you keen to code this up Aaron... or hoping I would ?








> =head2 translate
> 
>  Title   : translate
>  Usage   : $prot = $obj->translate
>  Function: produces an intron and frameshift-aware translation
>            of the DNA sequence; same functionality as the translate() from
>            Bio::PrimarySeqI
>  Returns : A Bio::PrimarySeqI implementing object
>  Args    : character for terminator (optional) defaults to '*'
>            character for unknown amino acid (optional) defaults to 'X'
>            frame (optional) valid values 0, 1, 2, defaults to 0
>            codon table id (optional) defaults to 1
>            complete coding sequence expected, defaults to 0 (false)
>            boolean, throw exception if not complete CDS (true) or defaults
>            to warning (false)
> 
> =cut
> 
> sub translate { shift(@_)->cds(@_)->translate(@_); }
> 
> =head2 as_string
> 
>  Title   : as_string
>  Usage   : print $obj->as_string(-gaps => '-',
>                                  -frameshifts => '<>',
>                                  -introns => '[]')
>  Function: produces a string that includes symbols for frameshifts and
>            introns for printing, e.g. "ATAC[AGCG]AG<GTG>GT-GG"
>  Returns : said string
>  Args    : gaps - symbol to use for gaps
>            frameshifts - two symbols to use for deletional frameshift and
>                          insertional frameshift
>            introns - symbols to use for the beginning and ending of
>                      an intron
> 
> =cut
> 
> # code omitted for brevity and because it doesn't exist yet; this could
> # also be the function for an overloaded quote operator; why doesn't
> # Bio::PrimarySeq already have an overloaded quote, btw?
> 
> =head2 cds
> 
>  Title   : cds
>  Usage   : $cDNA = $dna->cds;
>  Function: Like translate(), but does not perform the translation, only
>            calculates the correct codons (given frame, gaps,
>            frameshifts and introns) and strings them together until the
>            first stop codon is seen (which is included in the output).
>  Returns : A Bio::PrimarySeqI implementing object
>  Args    : [ see translate;  ]
> 
> =cut
> 
> # ditto.
> 
> II. An alignment object willing to deal with the "duality" of these
> seq objects; that can maintain an (multiple) alignment of proteins, some
> of which may be only implictly represented by an encoding DNA sequence;
> perhaps Bio::EncodedAlign to match Bio::EncodingSeq ?  Would have the
> same functionality as Bio::SimpleAlign (and would, in fact, reuse its
> methods), but would just sit as an intermediate to handle the translations
> of the DNA sequence, and provide direct access to the codons.
> 
> As usual, I have a fairly protein-centric view of the world; what am I
> neglecting to realize?
> 
> -Aaron
> 
> -- 
>  Aaron J Mackey
>  Pearson Laboratory
>  University of Virginia
>  (434) 924-2821
>  amackey@virginia.edu
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
> 

-----------------------------------------------------------------
Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
<birney@ebi.ac.uk>. 
-----------------------------------------------------------------