[BioPython] Translating ambiguous stop codons

Fri Mar 10 14:07:56 UTC 2006

I've been working on simple gene finding within sequence contigs from 
unfinished genomes.  Very simply I have used biopython to scan each of 
the six frames looking for a start codon, translating until the next 
stop codon - and repeating.  This is a pretty simple way of generating a 
list of possible open reading frames for further analysis.

Unfortunately (as is probably the case for many unfinished genomes) 
there are some ambiguous codons which could code for an amino acid or a 
stop codon:

e.g.
"NAG" could be E, K, Q or a stop codon
"YAG" could be either Q or a stop codon (as Y = C or T)

For example, If I have the ambiguous sequence 
"CAAGGCGTCGAAYAGCTTCAGGAACAGGAC" and try and translate it I get an 
exception, "TranslationError: YAG"

from Bio.Seq import Seq
from Bio import Translate
my_translator = Translate.ambiguous_dna_by_id[11]
my_dna = Seq('CAAGGCGTCGAAYAGCTTCAGGAACAGGAC', \
              my_translator.table.nucleotide_alphabet)

#print my_translator.translate_to_stop(my_dna)
print my_translator.translate(my_dna)

The possible translations are 'QGVEQLQEQD', and 'QGVE*LQEQD'

Is this situation something many other BioPython users have had to deal 
with?  I could write my own translate method for this particular 
application, but was wondering how best to support this within the basic 
BioPython setup.

Suggestion One - Fairly Simple
==============================
The translate_to_stop method could be enhanced with an option to control 
how it copes with ambiguous codons that could be either a stop or an 
amino acid:
(i) Treat as a stop codon "*", and stop translating there
(ii) Treat as amino acid, and continue translating
(iii) Treat as ambiguous (see suggestion two) and continue translating

As this is an unusual case, the additional code would only be triggered 
rarely so should not have much impact on the speed of the typical 
translation.

This could also be done to the translate method giving:
(i) Treat as stop codon, e.g. 'QGVE*LQEQD'
(ii) Treat as amino acid, e.g. 'QGVEXLQEQD' or better 'QGVEQLQEQD'
(iii) Treat as ambiguous (see suggestion two)

In this case (codon = "YAG") if we assume it is an amino acid (and not a 
stop codon) it must be "Q".  In other examples (e.g. "NAG") then the 
result would be E, K or Q and thus result in translation "X".

Suggestion Two - Complex
========================
Biopython uses "*" for a stop codon, and "X" for any amino acid.  There 
does not seem to be a symbol for either a stop codon or an amino acid, 
e.g. "?".  As far as I can tell, there is no IUPAC standard for this...

If this existed (maybe in a variant of the IUPACAmbiguousDNA alphabet) 
then we could expect to get back 'QGVE?LQEQD' from translate.

Old, but fairly relevant, email from Andrew Dalke

http://biopython.org/pipermail/biopython-dev/2000-August/000072.html

Peter