[Bioperl-l] Translating codons

Heikki Lehvaslaiho heikki@ebi.ac.uk
Mon, 25 Jun 2001 10:20:31 +0100


Amir,

Thanks for input. I'll see if we can improve the translate using your
suggestions. I have my reservations though. Translations are quite
complex beasts.

1. CodonTable is an object, so it introduces some overhead. 
   We have to have descriptions for the 17 existing codon 
   tables.

2. We have to be compatible with EMBL/Genbank translations
   which complicate matters considerably. 
   These rule that e.g. a two character unambiguous codons 
   have to return an amino acid. Also, translate has to 
   function differently when fullCDS flag is up.

3. We have to deal with ambiguous nucleotides. Unambiguous 
   codon translation calls are optimized.

I would help a lot if you could have a look at
Bio::Tools::CodonTable::translate and see if all that could be put
into your %amino hash. POD cods and comments explain what is required.
There are quite extensive tests on required functionality in
t/CodonTable.t and t/Seq.t.

If I am missing the point here, please tell me.

Yours,
	-Heikki

"Karger, Amir" wrote:
> 
> Am I correct in thinking that the default PrimarySeqI::translate method is
> pretty slow? It calls translate on each three-letter codon. Why not have
> translate take any sequence with length 3n, returning a string of length n?
> Just move the for loop inside the subroutine. It seems like it would still
> work if you happen to put in a single codon, but this way would work faster
> for sequences of, say, thousands of bases.
> 
> For example, here's code that translates a protein.
> ---------------------------
> use Benchmark;
> 
> my $seq = "actgactgactgactggtgcactacgacta" x 1000;
> my $len = length($seq);
> 
> %amino = &get_codons;
> 
> timethese(50, {
>     "substr" => \&do_substr,
>     "match" =>  \&do_match,
>     "pack" => \&do_pack,
> }, "dividing large string with subroutine" );
> 
> sub translate {
>     my $in = shift;
>     $out = $amino{$in};
>     return $out;
> }
> 
> sub do_substr {
>     my $protein = "";
>     for ($i = 0 ; $i < $len ; $i += 3)  {
>         my $codon = substr($seq, $i, 3);
>         $protein .= &translate($codon);
>     }
>     return $protein;
> }
> 
> [stuff that's the same as do_substr snipped]
> sub do_match {
>     my @triplet = ($seq =~ /(...)/g);
> }
> 
> sub do_pack {
>     my @triplet = unpack("A3" x ($len/3), $seq);
> }
> ------------------------
> 
> (Out of curiosity, I tried three methods of splitting the string.
> Surprisingly, the difference between them seems to be only about 5%. But...)
> As you can see, there's a 100% or so speedup, when I changed the code to
> just do the $amino{$codon} inside the do_* subs, rather than calling
> &translate).
> 
> Benchmark: timing 50 iterations of match, pack, substr without sub call.
>      match: 8 8.04 0.03 0 0 dividing large string
>       pack: 7 7.27 0 0 0
>       substr: 9 8.22 0 0 0
> 
> Benchmark: timing 50 iterations of match, pack, substr with n sub calls
>      match: 19 17.82 0.01 0 0 dividing large string with subroutine
>       pack: 19 16.96 0.01 0 0 dividing large string with subroutine
>     substr: 19 18.96 0 0 0 dividing large string with subroutine
> 
> As far as I can tell, this would be a pretty easy change and wouldn't break
> anything. (Famous last words.)
> 
> Amir Karger
> Curagen Corporation
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l

-- 
______ _/      _/_____________________________________________________
      _/      _/                      http://www.ebi.ac.uk/mutations/
     _/  _/  _/  Heikki Lehvaslaiho          heikki@ebi.ac.uk
    _/_/_/_/_/  EMBL Outstation, European Bioinformatics Institute
   _/  _/  _/  Wellcome Trust Genome Campus, Hinxton
  _/  _/  _/  Cambs. CB10 1SD, United Kingdom
     _/      Phone: +44 (0)1223 494 644   FAX: +44 (0)1223 494 468
___ _/_/_/_/_/________________________________________________________