[BioPython] Translating ambiguous stop codons
Peter (BioPython List)
biopython at maubp.freeserve.co.uk
Fri Mar 10 14:07:56 UTC 2006
I've been working on simple gene finding within sequence contigs from
unfinished genomes. Very simply I have used biopython to scan each of
the six frames looking for a start codon, translating until the next
stop codon - and repeating. This is a pretty simple way of generating a
list of possible open reading frames for further analysis.
Unfortunately (as is probably the case for many unfinished genomes)
there are some ambiguous codons which could code for an amino acid or a
stop codon:
e.g.
"NAG" could be E, K, Q or a stop codon
"YAG" could be either Q or a stop codon (as Y = C or T)
For example, If I have the ambiguous sequence
"CAAGGCGTCGAAYAGCTTCAGGAACAGGAC" and try and translate it I get an
exception, "TranslationError: YAG"
from Bio.Seq import Seq
from Bio import Translate
my_translator = Translate.ambiguous_dna_by_id[11]
my_dna = Seq('CAAGGCGTCGAAYAGCTTCAGGAACAGGAC', \
my_translator.table.nucleotide_alphabet)
#print my_translator.translate_to_stop(my_dna)
print my_translator.translate(my_dna)
The possible translations are 'QGVEQLQEQD', and 'QGVE*LQEQD'
Is this situation something many other BioPython users have had to deal
with? I could write my own translate method for this particular
application, but was wondering how best to support this within the basic
BioPython setup.
Suggestion One - Fairly Simple
==============================
The translate_to_stop method could be enhanced with an option to control
how it copes with ambiguous codons that could be either a stop or an
amino acid:
(i) Treat as a stop codon "*", and stop translating there
(ii) Treat as amino acid, and continue translating
(iii) Treat as ambiguous (see suggestion two) and continue translating
As this is an unusual case, the additional code would only be triggered
rarely so should not have much impact on the speed of the typical
translation.
This could also be done to the translate method giving:
(i) Treat as stop codon, e.g. 'QGVE*LQEQD'
(ii) Treat as amino acid, e.g. 'QGVEXLQEQD' or better 'QGVEQLQEQD'
(iii) Treat as ambiguous (see suggestion two)
In this case (codon = "YAG") if we assume it is an amino acid (and not a
stop codon) it must be "Q". In other examples (e.g. "NAG") then the
result would be E, K or Q and thus result in translation "X".
Suggestion Two - Complex
========================
Biopython uses "*" for a stop codon, and "X" for any amino acid. There
does not seem to be a symbol for either a stop codon or an amino acid,
e.g. "?". As far as I can tell, there is no IUPAC standard for this...
If this existed (maybe in a variant of the IUPACAmbiguousDNA alphabet)
then we could expect to get back 'QGVE?LQEQD' from translate.
Old, but fairly relevant, email from Andrew Dalke
http://biopython.org/pipermail/biopython-dev/2000-August/000072.html
Peter
More information about the Biopython
mailing list