[Bioperl-l] Translating codons

Mon, 25 Jun 2001 11:47:07 -0400

> Thanks for input. I'll see if we can improve the translate using your
> suggestions. I have my reservations though. Translations are quite
> complex beasts.

Thanks for taking my input seriously. I'm very new to bioinformatics, so I'm
sure some of my thoughts have been thought of and discarded a long time ago.

[snip difficulties in CodonTable->translate]
> 
> I would help a lot if you could have a look at
> Bio::Tools::CodonTable::translate and see if all that could be put
> into your %amino hash. POD cods and comments explain what is required.
> There are quite extensive tests on required functionality in
> t/CodonTable.t and t/Seq.t.
> 
> If I am missing the point here, please tell me.

Well, I definitely didn't look closely enough at CodonTable->translate and
realize how complicated things are. However, my main point actually doesn't
depend on how complicated this stuff is. I think.

All I was trying to say is that you get overhead just from calling a
subroutine. So why not move the for() loop from PrimarySeqI->translate into
CodonTable->translate? How about this:

sub translate_long {
    my ($self, $seq) = @_;
    my $id = $self->id;
    my $l = length $seq;
    throw "Need a sequence of length 3n!" if $l % 3; 
    $seq = lc $seq; 
    $seq =~ tr/u/t/;
    $protein = "";
    if ($seq =~ /[^actgu]/i) {
        # No ambiguous codons!
        for ($i = 0; $i < length($seq); $i+=3) {
            $triplet = substr($seq, $i, 3);
            if (exists $codons->{$triplet}) {
                $protein .= substr($tables[$id-1], $codons->{$triplet}, 1);
            } else {
                $protein .= 'X';
            }
        }
    } else {
        for ($i = 0; $i < length($seq); $i+=3) {
            $triplet = substr($seq, $i, 3);
            $protein .= exists $codons->{$triplet} ? $codons->{$triplet} :
'X'
            my $aa;
            my @codons = _unambiquos_codons($triplet);
            # ...
            # More code from CodonTable->translate, only set $aa instead of
            # returning things, and then
            $protein .= $aa;
        }
    }
    return $protein;
}

The calling subroutine could worry about what if the sequence isn't of
length 3n, etc. It seems to me like this could be faster than calling
translate_strict/translate many times. Of course, maybe I should shut up
until I get some more bioinformatics experience, but isn't it true that you
very often want to translate relatively long sequences?

If you implemented the above, you would have to change PrimarySeq->translate
a bit more (And I figured why not do all the substitutions in one s///?):

552,561c552,564
<   # Get a sequence of length 3n
<   my $l = length $seq;
<   my $m = $l - ($l % 3);
<   my $subseq = substr($seq, 0, $m);
<   # Translate it
<   my $output = $codonTable->translate($subseq);
<   # Use user-input stop/unknown
<   $output =~ s/\*/$stop/g;
<   $output =~ s/X/$unknown/g
< 
---
>   for ($i = 0 ; $i < $length ; $i += 3)  {
>       my $codon = substr($seq, $i, 3);
>       my $aa = $codonTable->translate($codon);
>       if ($aa eq '*') {
>          $output .= $stop;
>       }
>       elsif ($aa eq 'X') {
>          $output .= $unknown;
>       }
>       else {
>         $output .= $aa ;
>       }
>   }

I haven't written explicitly what to do if the sequence isn't length 3n,
since I don't know what the right molecular bio thing to do is, but I assume
there's something.

Anyway, would this be at all useful?

Amir Karger
CuraGen Corporation