[Biopython-dev] [Bug 2783] New: Using alternative start codons in Bio.Seq translate method/function

Fri Mar 6 17:34:58 UTC 2009

http://bugzilla.open-bio.org/show_bug.cgi?id=2783

           Summary: Using alternative start codons in Bio.Seq translate
                    method/function
           Product: Biopython
           Version: Not Applicable
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk

This bug covers an issue originally raised on Bug 2381.  This bug is
specifically for how to translates a CDS using a non-standard start codon (a
codon which doesn't normally encode methionine).

In computing, we often blindly translate without worrying about start codons. 
For example, you might translated a whole genomes (in all six frames) as part
of looking for open reading frames.  Translating a partial CDS where the start
is missing is another example.  The current Bio.Seq translation functionality
supports these usages.

In real biology however, translation from RNA to amino acids always starts at a
initiation/start codon (typically AUG) which becomes the methionine at the
start of the protein.  In eukaryotes, usually the only start codon is AUG, and
it normally encodes methionine, so this doesn't seem special.  However, in many
organisms there are lots of genes with a alternative start/initiation codons
which do NOT normally encode methionine.  However, when they are used as a
start/initiation code they DO get translated as methionine!

For example, there are 418 annotated genes in E. coli K12 with non-standard
start codons - which you might want to translate into proteins (which *should*
start with a methionine).

For example, using the following NCBI FASTA file of CDS sequences,
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K12_substr__MG1655

Here is the CDS for gene yaaX:

>ref|NC_000913.2|:5234-5530
GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA
GCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGAT
AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT
TATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCAT
AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA

This starts GTC which is a valid bacterial start codon.  I'd like to be able to
translate this and get the actual biologically relevant protein as given in the
GenBank file NC_000913.gbk (with or without the stop symbol at the end), which
starts with "M" not "V":

     CDS             5234..5530
                     /gene="yaaX"
                     /locus_tag="b0005"
                     /codon_start=1
                     /transl_table=11
                     /product="predicted protein"
                     /protein_id="NP_414546.1"
                     /db_xref="ASAP:ABE-0000015"
                     /db_xref="UniProtKB/Swiss-Prot:P75616"
                     /db_xref="GI:16127999"
                     /db_xref="ECOCYC:G6081"
                     /db_xref="EcoGene:EG14384"
                     /db_xref="GeneID:944747"
                     /translation="MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGY
                     YWDGGHWRDHGWWKQHYEWRGNRWHLHGPPPPPRHHKKAPHDHHGGHGPGKHHR"

Without any non-standard start codon support, my translations start with a V
(rather than the desired M):

>>> from Bio.Seq import Seq
>>> yaaX = Seq("GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA"
...            "GCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGAT"
...            "AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT"
...            "TATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCAT"
...            "AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA")
>>> print yaaX.translate(table=11)
VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDHGWWKQHYEWRGNRWHLHGPPPPPRHHKKAPHDHHGGHGPGKHHR*
>>> print yaaX.translate(table=11, to_stop=True)
VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDHGWWKQHYEWRGNRWHLHGPPPPPRHHKKAPHDHHGGHGPGKHHR

These start with "V", while in this situation I want an "M" because I know this
is a full CDS and the first codon is a start codon.

I therefore want to add an optional argument to the Seq object's translate
method (and the Bio.Seq.translate function) so that I can obtain the desired
results (both with and without the terminator stop symbol).  I want an option
to tell Biopython that this sequence commences with a start/initiation codon:

>>> print yaaX.translate(table=11, with_start_codon=True)
MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDHGWWKQHYEWRGNRWHLHGPPPPPRHHKKAPHDHHGGHGPGKHHR*
>>> print yaaX.translate(table=11, to_stop=True, with_start_codon=True)
MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDHGWWKQHYEWRGNRWHLHGPPPPPRHHKKAPHDHHGGHGPGKHHR

I have in the above example called this new argument "with_start_codon", but I
am open to naming suggestions.  If False (default), then nothing changes.  If
the new argument is True, this indicates that the first codon should be a valid
start/initiation codon (in the declared translation table), and that it should
be translated as a methionine.

I will upload a patch implementing this in a moment...

This proposal is NOT about an option to have the translate function/method
search the sequence for the first valid start codon (either in frame or not).

This proposal is NOT about an option to check the sequence is a valid CDS (i.e.
starts with a start codon, ends with an in frame stop codon, and has no
internal premature stop codons), and then translating it.  While this makes
sense (and BioPerl does this), this would prevent certain uses.  e.g. a partial
CDS sequence where the 3' end is missing.

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.