[Biopython-dev] [Bug 2547] New: Translation of ambiguous codons like NNN and TAN

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Sun Jul 20 14:46:23 UTC 2008


http://bugzilla.open-bio.org/show_bug.cgi?id=2547

           Summary: Translation of ambiguous codons like NNN and TAN
           Product: Biopython
           Version: 1.47
          Platform: PC
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk


It is often useful to want to translate ambiguous nucleotide sequences (e.g.
EST sequences), and this may contain codons which could code for an amino acid
OR a stop codon (e.g. NNN, TNN or TAN).

See for example Bug 2530 comment 6 and comment 9.

Currently Bio.Seq.translate() will not translate such sequences and raises an
exception.

The following example shows correct translation of ambiguous codons which only
encode valid amino acid(s) OR valid stop codons (but not both):

from Bio.Seq import translate
assert translate("TAA") == "*"
assert translate("TAG") == "*"
assert translate("TAT") == "Y"
assert translate("TAC") == "Y"
#Recall ambiguous nucleotide Y means T or C (pYrimidine)
#so TAY = TAT or TAC which both code for Y (Tyr, Tyrosine)
assert translate("TAY") == "Y"
#Recall ambigous nucleoide R means G or A (puRine)
#so TAR = TAG or TAA which both code for a stop codon
assert translate("TAR") == "*"

However, in Biopython 1.47 the following all raise an exception:

translate("TAN")
translate("TAM")
translate("TAK")
translate("TRR")
translate("TNN")
translate("NNN")

TAN, TAM, TAK, ... can code for Y or stop.  More generally, "TRR" and "TNN" can
code multiple amino acids or a stop codon, and "NNN" can code for any amino
acid or a stop codon.

According to IUPAC, the single letter protein code X is an "unknown or 'other'
amino acid" (igoring its historic and obsolete usage for selenocysteine, now
U).
http://www.chem.qmul.ac.uk/iupac/AminoAcid/A2021.html

This document does NOT cover the idea of stop codons, and I am not aware of any
additional symbol to mean "any amino acid OR a stop codon" which would be ideal
for this situation.

For comparison, the EMBOSS transeq tool will use X when given a codon which
could be either an amino acid OR a stop codon:

$ transeq -filter asis:NNNTANTARTAGTAYTAC
XX**YY

Therefore one solution would be to follow EMBOSS and return X for codons which
could be an amino acid OR a stop codon.

See also Bug 2530 on the related issue that Bio.Seq.translate() currently
translates invalid codons as "*" (presumably an accidental side effect of the
implementation).


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the Biopython-dev mailing list