[Bioperl-l] Re: Frameshifts in alignments ... ?
Ewan Birney
birney@ebi.ac.uk
Thu, 5 Sep 2002 08:05:34 +0100 (BST)
On Wed, 4 Sep 2002, Aaron J Mackey wrote:
>
> package Bio::EncodedSeq;
I think we should go for Bio::Seq::EncodedSeq
>
> use strict;
> use Bio::LocatableSeq;
>
> @ISA = qw(Bio::LocatableSeq);
>
> =head2 new
> Title : new
> Usage : $obj = Bio::EncodedSeq->new(-dnaseq => "AGTACGTGTCATG",
> -encoding => "CCCCCCFCCCCCC",
> -id => "myseq",
> -start => 1,
> -end => 13,
> -strand => 1
> );
> Function: creates a new Bio::EncodedSeq object from a supplied DNA
> sequence
> Returns : a new Bio::EncodedSeq object
> Args : dnaseq - primary nucleotide sequence used to encode the
> protein
> encoding - a string of characters (see Encoding Table)
> describing backwards frameshifts implied by the
> encoding but not present in the sequence will be
> added (as '-'s) to the sequence. If not
> supplied, it will be assumed that all positions
> are coding (C). Encoding may include either
> implicit phase encoding characters (i.e. "CCC")
> and/or explicit encoding characters (i.e. "CDE").
> Alternatively, encoding may be a hashref
> datastructure, with encoding characters as keys
> and Bio::LocationI objects (or arrayrefs of
> Bio::LocationI objects) as values, e.g.:
> { C => [ Bio::Location::Simple->new(1,9),
> Bio::Location::Simple->new(11,13) ],
> F => Bio::Location::Simple->new(10,10),
> } # same as "CCCCCCCCCFCCC"
> id, start, end, strand - as with Bio::LocatableSeq; note
> that the coordinates are relative to the
> encoding DNA sequence, not the implicit protein
> sequence.
> =cut
>
> =head2 encoding
> Title : encoding
> Usage : $obj->encoding("CCCCCC");
> $obj->encoding( -encoding => { I => $location } );
> $enc = $obj->encoding(-explicit => 1);
> $enc = $obj->encoding("CCCCCC", -explicit => 1);
> $enc = $obj->encoding(-location => $location,
> -explicit => 1 );
> Function: get/set the objects encoding, either globally or by location(s).
> Returns : the (possibly new) encoding string.
> Args : encoding - see the encoding argument to the new() function.
> explicit - whether or not to return explicit phase
> information in the coding (i.e. "CCC" becomes
> "CDE", "III" becomes "IJK", etc); defaults to 0.
> location - optional; location to get/set the encoding.
> Defaults to the entire sequence.
> =cut
>
> =head2 cds
> Title : cds
> Usage : $cds = $obj->cds();
> Function: obtain the "spliced" DNA sequence, by removing any
> nucleotides that participate in an UTR, forward frameshift
> or intron, and replacing any unknown nucleotide implied by
> a backward frameshift or gap with N's.
> Returns : a Bio::EncodedSeq object, with an encoding consisting only
> of "CCCC..".
> Args : none.
> =cut
>
> =head2 translate
> Title : translate
> Usage : $prot = $obj->translate(@args);
> Function: obtain the protein sequence encoded by the underlying DNA
> sequence; same as $obj->cds()->translate(@args).
> Returns : a Bio::PrimarySeq object.
> Args : same as the translate() function of Bio::PrimarySeqI
> =cut
>
> =head2 seq
> Title : seq
> Usage : $protseq = $obj->seq();
> Function: obtain the raw protein sequence encoded by the underlying
> DNA sequence; This is the same as calling
> $obj->translate()->seq();
> Returns : a string of single-letter amino acid codes
> Args : same as the seq() function of Bio::PrimarySeq; note that this
> function may not be used to set the protein sequence; see
> the dnaseq() function for that.
> =cut
>
> =head2 dnaseq
> Title : dnaseq
> Usage : $dnaseq = $obj->dnaseq();
> $obj->dnaseq("ACGTGTCGT", "CCCCCCCCC");
> $obj->dnaseq(-dnaseq => "ATG",
> -encoding => "CCC",
> -location => $loc );
> Function: get/set the underlying DNA sequence; will overwrite any
> current DNA and/or encoding information present.
> Returns : a string of single-letter nucleotide codes, including any
> gaps implied by the encoding.
> Args : dnaseq - the DNA sequence to be used as a replacement
> encoding - the encoding of the DNA sequence (see the new()
> constructor); defaults to all 'C'.
> location - optional, the location of the DNA sequence to
> get/set; defaults to the entire sequence.
> =cut
>
> [ and all the inherited Bio::LocatableSeq and Bio::PrimarySeqI
> methods; note that the coordinates of those methods will refer only to
> the underlying DNA sequence, not the implicit encoded protein sequence
> - my next task will be to extend Ewan and Heikki's Bio::Coordinate
> system to include Bio::Coordinate::EncodedPair so that conversions can
> be made more easily ... any comments on that? ]
You are a brave man. Look forward to seeing this in...
>
> thanks for reading,
>
> -Aaron
>
> --
> Aaron J Mackey
> Pearson Laboratory
> University of Virginia
> (434) 924-2821
> amackey@virginia.edu
>
>
>
>
-----------------------------------------------------------------
Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
<birney@ebi.ac.uk>.
-----------------------------------------------------------------