[Bioperl-l] Re: Frameshifts in alignments ... ?
Aaron J Mackey
Aaron J. Mackey" <amackey@virginia.edu
Wed, 4 Sep 2002 11:00:13 -0400 (EDT)
Ewan wrote:
> Remember that the "encoding" is as well as the bases, ie, one effectively
> has two "tracks", being
>
> CCCCCCCCCCCIIIIIIIIIIIIIIIIIIIIIIICCCCCGGGCCCC
> ATGGGTGTATGTATTGTGTAAAAAGAATGTTAAGGTTGT---GTET
>
>
> I am happy to get into this. I would propose the following encodings:
>
> C - indicates is part of a codon (do we need C = 1st base of codon, D = 2nd base of codon, E = 3rd base of codon?)
I like the "CDE", myself; users can supply an encoded sequence with just
C's though.
> I - intron (also could go for I = phase 0 intron, J = phase 1 intron, K = phase 2 intron)
Ditto.
> F - frameshift indel (ie, sequencing error)
Frameshifts come in a few varieties: "forward" (i.e. we skip an extra
nucleotide, and translate the next full codon) or "backward" (i.e. we're
missing a nucleotide, so we only use the adjacent two nucleotides for the
codon). For example, using '\' and '/' for forward and backwards
frameshifts (whitespace added for clarity):
ATG GCTA GCT A-C GAC TAC
CCC C\CC CCC C/C CCC CCC
leads to a CDS of:
ATG GTA GCT ANC GAC TAC
So, in a sense, a backward frameshift is a bit like a gap, where a forward
frameshift is a bit like a one-nucleotide intron. Regardless, I need two
symbols, F and B maybe? Just nod if this works for you.
> G - gap, implicitly protein coding
>
> H - gap, in "other" dna sequence
These are protein indel parallels to F and B, right? i.e.:
ATGGCT---GTATGT
CCCCCCGGGCCCHHH
would this then include the last codon TGT in the CDS? If so, how does
HHH differ from CCC, in terms of influencing the encoded protein sequence?
If it's only in terms of expressing the presence of a gap in another
sequence, then I think that's better handled in that other sequence
(together in a SimpleAlign or similar object). This is still a bit fuzzy
for me ...
> U - UTR (or U - 5'UTR, V - 3'UTR)
Ditto as above for CDE and IJK; U and V will be used by us, but the user
can just say U (and if we know strand, we can work out what's what).
OK, then here is my proposed functionality; please read carefully as this
is exactly what you'll get, unless someone raises a little noise soon:
package Bio::EncodedSeq;
use strict;
use Bio::LocatableSeq;
@ISA = qw(Bio::LocatableSeq);
=head2 new
Title : new
Usage : $obj = Bio::EncodedSeq->new(-dnaseq => "AGTACGTGTCATG",
-encoding => "CCCCCCFCCCCCC",
-id => "myseq",
-start => 1,
-end => 13,
-strand => 1
);
Function: creates a new Bio::EncodedSeq object from a supplied DNA
sequence
Returns : a new Bio::EncodedSeq object
Args : dnaseq - primary nucleotide sequence used to encode the
protein
encoding - a string of characters (see Encoding Table)
describing backwards frameshifts implied by the
encoding but not present in the sequence will be
added (as '-'s) to the sequence. If not
supplied, it will be assumed that all positions
are coding (C). Encoding may include either
implicit phase encoding characters (i.e. "CCC")
and/or explicit encoding characters (i.e. "CDE").
Alternatively, encoding may be a hashref
datastructure, with encoding characters as keys
and Bio::LocationI objects (or arrayrefs of
Bio::LocationI objects) as values, e.g.:
{ C => [ Bio::Location::Simple->new(1,9),
Bio::Location::Simple->new(11,13) ],
F => Bio::Location::Simple->new(10,10),
} # same as "CCCCCCCCCFCCC"
id, start, end, strand - as with Bio::LocatableSeq; note
that the coordinates are relative to the
encoding DNA sequence, not the implicit protein
sequence.
=cut
=head2 encoding
Title : encoding
Usage : $obj->encoding("CCCCCC");
$obj->encoding( -encoding => { I => $location } );
$enc = $obj->encoding(-explicit => 1);
$enc = $obj->encoding("CCCCCC", -explicit => 1);
$enc = $obj->encoding(-location => $location,
-explicit => 1 );
Function: get/set the objects encoding, either globally or by location(s).
Returns : the (possibly new) encoding string.
Args : encoding - see the encoding argument to the new() function.
explicit - whether or not to return explicit phase
information in the coding (i.e. "CCC" becomes
"CDE", "III" becomes "IJK", etc); defaults to 0.
location - optional; location to get/set the encoding.
Defaults to the entire sequence.
=cut
=head2 cds
Title : cds
Usage : $cds = $obj->cds();
Function: obtain the "spliced" DNA sequence, by removing any
nucleotides that participate in an UTR, forward frameshift
or intron, and replacing any unknown nucleotide implied by
a backward frameshift or gap with N's.
Returns : a Bio::EncodedSeq object, with an encoding consisting only
of "CCCC..".
Args : none.
=cut
=head2 translate
Title : translate
Usage : $prot = $obj->translate(@args);
Function: obtain the protein sequence encoded by the underlying DNA
sequence; same as $obj->cds()->translate(@args).
Returns : a Bio::PrimarySeq object.
Args : same as the translate() function of Bio::PrimarySeqI
=cut
=head2 seq
Title : seq
Usage : $protseq = $obj->seq();
Function: obtain the raw protein sequence encoded by the underlying
DNA sequence; This is the same as calling
$obj->translate()->seq();
Returns : a string of single-letter amino acid codes
Args : same as the seq() function of Bio::PrimarySeq; note that this
function may not be used to set the protein sequence; see
the dnaseq() function for that.
=cut
=head2 dnaseq
Title : dnaseq
Usage : $dnaseq = $obj->dnaseq();
$obj->dnaseq("ACGTGTCGT", "CCCCCCCCC");
$obj->dnaseq(-dnaseq => "ATG",
-encoding => "CCC",
-location => $loc );
Function: get/set the underlying DNA sequence; will overwrite any
current DNA and/or encoding information present.
Returns : a string of single-letter nucleotide codes, including any
gaps implied by the encoding.
Args : dnaseq - the DNA sequence to be used as a replacement
encoding - the encoding of the DNA sequence (see the new()
constructor); defaults to all 'C'.
location - optional, the location of the DNA sequence to
get/set; defaults to the entire sequence.
=cut
[ and all the inherited Bio::LocatableSeq and Bio::PrimarySeqI
methods; note that the coordinates of those methods will refer only to
the underlying DNA sequence, not the implicit encoded protein sequence
- my next task will be to extend Ewan and Heikki's Bio::Coordinate
system to include Bio::Coordinate::EncodedPair so that conversions can
be made more easily ... any comments on that? ]
thanks for reading,
-Aaron
--
Aaron J Mackey
Pearson Laboratory
University of Virginia
(434) 924-2821
amackey@virginia.edu