[Bioperl-l] Translating codons : re-coded

Heikki Lehvaslaiho heikki@ebi.ac.uk
Tue, 26 Jun 2001 13:01:15 +0100


Dear Amir,

When I started writing CodonTable.pm a bit more than a year ago I was
overwhelmed by details of translation process and followed the wise
advise to keep it simple and let sequence objects handle the sequence
level and keep CodonTable focused on codon level. Translation was
really slow at first. Kim Rutherford suggested separating unambiguous
code into a subroutine which speeded thing up a bit.

Without your efforts and insight I would not have bothered to look at
it again. Do not touch what is not broken is usually a good rule.

Yesterday I started to code in anger. I was determined to rewrite
translation according to your suggestions but get all the ambiguities
in so that all the bioperl tests on translation process pass.

I fixed the last bugs this morning and was able to compare execution
times.
The results are heavily dependent on sequence length and because of
the way I've structured the code, it runs faster if the sequence
length is a multiple of 3.

The speed up is 3-6x which is pretty good. PrimarySeqI::translate
handles now _only_ protein level issues. CodonTable::translate can
take in sequences of any length. 

For the time being, I've left the old code in as
PrimarySeqI::translate_old and CodonTable::translate_old, in case
there there are some problems. The code is committed into head of
bioperl-live.

Thanks again for your interest and help,

	-Heikki

P.S. Here are the gory details of benchmarking.


The program (tranlate_test.pl) I used :
-----------------------------------------------------------
#!/usr/local/bin/perl  
use Benchmark;
use Bio::PrimarySeq;

$multiples = shift;
$multiples ||= 100;
$seq_string = "actgactgactgactggtgcactacgacta" x $multiples;
#$seq_string .= 'g';
$seq = new Bio::PrimarySeq(-seq => $seq_string, 
                           -id => 'no.One',
                           -moltype => 'dna'
                           );

print "seq length = ", $seq->length, "\n";
print "-" x 30, "\n";

Benchmark::cmpthese(100, {
    "new" => sub {$seq->translate},
    "old" => sub {$seq->translate_old}
});
print "-" x 30, "\n";
-----------------------------------------------------------

The benchmarks were run in a puny PII 266 Mhz laptop:

Sequences are multiples of 3.

odo > translate_test.pl 10
seq length = 300
------------------------------
Benchmark: timing 100 iterations of new, old...
       new:  0 wallclock secs ( 0.54 usr +  0.01 sys =  0.55 CPU) @
181.82/s (n=100)
            (warning: too few iterations for a reliable count)
       old:  3 wallclock secs ( 2.01 usr +  0.01 sys =  2.02 CPU) @
49.50/s (n=100)
      Rate  old  new
old 49.5/s   -- -73%
new  182/s 267%   --
------------------------------
odo > translate_test.pl 100
seq length = 3000
------------------------------
Benchmark: timing 100 iterations of new, old...
       new:  3 wallclock secs ( 2.83 usr +  0.00 sys =  2.83 CPU) @
35.34/s (n=100)
       old: 18 wallclock secs (17.30 usr +  0.04 sys = 17.34 CPU) @ 
5.77/s (n=100)
      Rate  old  new
old 5.77/s   -- -84%
new 35.3/s 513%   --
------------------------------
odo > translate_test.pl 1000
seq length = 30000
------------------------------
Benchmark: timing 100 iterations of new, old...
       new: 78 wallclock secs (75.84 usr +  0.04 sys = 75.88 CPU) @ 
1.32/s (n=100)
       old: 511 wallclock secs (501.65 usr +  0.16 sys = 501.81 CPU)
@  0.20/s (n=100)
    s/iter  old  new
old   5.02   -- -85%
new  0.759 561%   --
------------------------------


Same series again with one nucleotide added into the sequence.
Sequences are multiples of 3 + 1.

odo > translate_test.pl 10
seq length = 301
------------------------------
Benchmark: timing 100 iterations of new, old...
       new:  2 wallclock secs ( 1.49 usr +  0.00 sys =  1.49 CPU) @
67.11/s (n=100)
       old:  5 wallclock secs ( 4.92 usr +  0.00 sys =  4.92 CPU) @
20.33/s (n=100)
      Rate  old  new
old 20.3/s   -- -70%
new 67.1/s 230%   --
------------------------------
odo > translate_test.pl 100
seq length = 3001
------------------------------
Benchmark: timing 100 iterations of new, old...
       new:  9 wallclock secs ( 9.23 usr +  0.02 sys =  9.25 CPU) @
10.81/s (n=100)
       old: 50 wallclock secs (49.95 usr +  0.01 sys = 49.96 CPU) @ 
2.00/s (n=100)
      Rate  old  new
old 2.00/s   -- -81%
new 10.8/s 440%   --
------------------------------
odo > translate_test.pl 1000
seq length = 30001
------------------------------
Benchmark: timing 100 iterations of new, old...
       new: 93 wallclock secs (91.03 usr +  0.07 sys = 91.10 CPU) @ 
1.10/s (n=100)
       old: 524 wallclock secs (513.89 usr +  0.22 sys = 514.11 CPU)
@  0.19/s (n=100)
    s/iter  old  new
old   5.14   -- -82%
new  0.911 464%   --
------------------------------


-- 
______ _/      _/_____________________________________________________
      _/      _/                      http://www.ebi.ac.uk/mutations/
     _/  _/  _/  Heikki Lehvaslaiho          heikki@ebi.ac.uk
    _/_/_/_/_/  EMBL Outstation, European Bioinformatics Institute
   _/  _/  _/  Wellcome Trust Genome Campus, Hinxton
  _/  _/  _/  Cambs. CB10 1SD, United Kingdom
     _/      Phone: +44 (0)1223 494 644   FAX: +44 (0)1223 494 468
___ _/_/_/_/_/________________________________________________________