[Bioperl-l] Re: Frameshifts in alignments ... ?

Tue, 3 Sep 2002 08:28:39 -0400 (EDT)

On Thu, 29 Aug 2002, Ewan Birney wrote:

> I guess from my perspective, if we do frameshifts, we should do introns as
> well (genewise stuff) and ... that leads into the more general view of
> alignments as a series of coordinates *and* states for each column.

[ ... ]

> The problem I hit is that what people want to do with thes alignments are
> wildly different - some people want to see the whole structure laid bare -
> others just want the implict protein alignment, in other cases people want
> the codon-by-codon alignment. What I have discovered is that there seems
> to be no ideal mixtures of ease of use *and* complete representation, and
> I have opted for the complete representation mode inside Wise2

[ ... ]

> What would you propose Aaron?

We can take a little step back, and recognize that we're already handling
gapped alignments fine, via Bio::LocatableSeq (OT aside: is there some
reason all these Seq subclasses weren't named Bio::Seq::Locatable or
somesuch?).  That's all we really nead for protein sequences.  For DNA
however, we end up with two more classes of "gaps": frameshifts and
introns.  Neither of those make any sense when discussing a protein
sequence, but there may be an implicit mapping between a protein sequence
and it's encoding DNA, which could include any of the above.  Further,
it'd be quite useful for me right now to have a DNA-containing Seq object,
that when I called translate() on it would be gap/frameshift/intron aware
(and you could imagine these marked-up DNA sequences as being part of an
alignment object, as well).

So that means we have to represent the frameshifts and introns somehow.
One easy way that you can do that immediately is via SeqFeatures with
simple locations on them that demarcate the gaps.  This may be a handy
means of representing the data (in terms of IO), and makes use of
infrastructure already in place.  It may not be the most efficient of data
structures for internal computation (as say, an N^K X 4 multidimensional
matrix of match/gap/frameshift/intron states for each of K sequences of
length N), but I imagine that such an internal represention could be
maintained if warranted.

So here's my naive proposal:

I. A richer Bio::LocatableSeq (perhaps renamed ... Bio::EncodingSeq ??) to
which SeqFeatures that represent introns and frameshifts can be added; and
perhaps the following additional methods, to begin:

=head2 translate

 Title   : translate
 Usage   : $prot = $obj->translate
 Function: produces an intron and frameshift-aware translation
           of the DNA sequence; same functionality as the translate() from
           Bio::PrimarySeqI
 Returns : A Bio::PrimarySeqI implementing object
 Args    : character for terminator (optional) defaults to '*'
           character for unknown amino acid (optional) defaults to 'X'
           frame (optional) valid values 0, 1, 2, defaults to 0
           codon table id (optional) defaults to 1
           complete coding sequence expected, defaults to 0 (false)
           boolean, throw exception if not complete CDS (true) or defaults
           to warning (false)

=cut

sub translate { shift(@_)->cds(@_)->translate(@_); }

=head2 as_string

 Title   : as_string
 Usage   : print $obj->as_string(-gaps => '-',
                                 -frameshifts => '<>',
                                 -introns => '[]')
 Function: produces a string that includes symbols for frameshifts and
           introns for printing, e.g. "ATAC[AGCG]AG<GTG>GT-GG"
 Returns : said string
 Args    : gaps - symbol to use for gaps
           frameshifts - two symbols to use for deletional frameshift and
                         insertional frameshift
           introns - symbols to use for the beginning and ending of
                     an intron

=cut

# code omitted for brevity and because it doesn't exist yet; this could
# also be the function for an overloaded quote operator; why doesn't
# Bio::PrimarySeq already have an overloaded quote, btw?

=head2 cds

 Title   : cds
 Usage   : $cDNA = $dna->cds;
 Function: Like translate(), but does not perform the translation, only
           calculates the correct codons (given frame, gaps,
           frameshifts and introns) and strings them together until the
           first stop codon is seen (which is included in the output).
 Returns : A Bio::PrimarySeqI implementing object
 Args    : [ see translate;  ]

=cut

# ditto.

II. An alignment object willing to deal with the "duality" of these
seq objects; that can maintain an (multiple) alignment of proteins, some
of which may be only implictly represented by an encoding DNA sequence;
perhaps Bio::EncodedAlign to match Bio::EncodingSeq ?  Would have the
same functionality as Bio::SimpleAlign (and would, in fact, reuse its
methods), but would just sit as an intermediate to handle the translations
of the DNA sequence, and provide direct access to the codons.

As usual, I have a fairly protein-centric view of the world; what am I
neglecting to realize?

-Aaron

-- 
 Aaron J Mackey
 Pearson Laboratory
 University of Virginia
 (434) 924-2821
 amackey@virginia.edu