[Biopython] back-translation method for Seq object?

Wed Mar 9 15:33:28 UTC 2011

This is a reply to an old thread (October 2008), but I thought someone 
might find it useful.

In that thread, discussing the representation of back-translations using 
ambiguous bases to avoid the factorial explosion of an all possibilities 
back-translation, Bruce Southey gave a table similar to the one below 
but some of the ambiguous codons were incorrect or the ambiguous codons 
were to ambiguous and covered more than one amino acid. The codons for 
stop (*) were also missing. Some were corrected later in the thread but 
not all.

Here are the correct ambiguous codons for the standard genetic code:

* = TAG, TAA, TGA                = TAR, TGA
A = GCT, GCC, GCA, GCG           = GCN
C = TGT, TGC                     = TGY
D = GAT, GAC                     = GAY
E = GAA, GAG                     = GAR
F = TTT, TTC                     = TTY
G = GGT, GGC, GGA, GGG           = GGN
H = CAT, CAC                     = CAY
I = ATT, ATC, ATA                = ATH
K = AAA, AAG                     = AAR
L = TTA, TTG, CTT, CTC, CTA, CTG = TTR, CTN
M = ATG                          = ATG
N = AAT, AAC                     = AAY
P = CCT, CCC, CCA, CCG           = CCN
Q = CAA, CAG                     = CAR
R = CGT, CGC, CGA, CGG, AGA, AGG = CGN, AGR
S = TCT, TCC, TCA, TCG, AGT, AGC = TCN, AGY
T = ACT, ACC, ACA, ACG           = ACN
V = GTT, GTC, GTA, GTG           = GTN
W = TGG                          = TGG
Y = TAT, TAC                     = TAY

Even though this is still not a one-to-one mapping in 4/21 cases the 
factorial explosion is significantly decreased. For example, the protein 
ACDEFGHIKLMNPQRSTVWY* has 1,019,215,872 unambiguous back-translations. 
Using the code above it has 16, or generally 2^(L+R+S+*).

If anyone has an algorithm for determining the set of non-overlapping 
ambiguous codons from any codon table I would like to know. Thanks,

Jon

-- 
Jonathan Blakes
School of Computer Science
University of Nottingham