[Biopython-dev] Bio.Seq: implementing of translation of gapped sequences

Peter Cock p.j.a.cock at googlemail.com
Tue Nov 3 14:54:29 UTC 2015

Hi all,

The pull request from Carlos for gap codon translation is here:


The proposed behaviour adds a gap argument to the translate
method (should this be gap_char to match the alphabet object?),
and will look at the alphabet by default.

I've written up some examples here:


There is one potentially surprising change to existing behaviour,
previously the gap codon would always raise a translation error:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_dna, Gapped
>>> Seq("ACT---TAA", Gapped(generic_dna)).translate()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "Bio/Seq.py", line 1004, in translate
    stop_symbol, to_stop, cds)
  File "Bio/Seq.py", line 2064, in _translate_str
    "Codon '{0}' is invalid".format(codon))
Bio.Data.CodonTable.TranslationError: Codon '---' is invalid

If the gap character is explicit via the alphabet this becomes:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_dna, Gapped
>>> Seq("ACT---TAA", Gapped(generic_dna)).translate()
Seq('T-*', HasStopCodon(Gapped(ExtendedIUPACProtein(), '-'), '*'))

Or using a different gap character,

>>> Seq("ACT~~~TAA", Gapped(generic_dna, "~")).translate()
Seq('T~*', HasStopCodon(Gapped(ExtendedIUPACProtein(), '~'), '*'))

I think this is a small change, and worth while overall for the
useful functionality - something I was planning to add myself
at some point but had never gotten round to.

Thoughts? Feedback? You can use GitHub if you prefer:




On Tue, Nov 3, 2015 at 8:26 AM, Carlos Peña <mycalesis at gmail.com> wrote:
> Hi all,
> I have prepared a pull request to try implementing the translation of gapped
> sequences (thanks Peter for guidelines!).
> The code will infer the gap character from the Seq object's given alphabet.
> If the alphabet is not present if can optionally accept a gap character,
> then it will return a protein sequence with gaps, otherwise it will raise a
> Translation error.
> At least for me, the change will allow simplify the code of my projects.
> However, this implementation might bite you if you are expecting a
> TranslationError from trying to translate a gapped sequence. Instead you
> will get back a gapped protein sequence (if you gap consists of dashes "-").
> Is this change desirable for the Biopython project? I noticed that the
> scikit-bio project does not implement gapped translations:
> https://github.com/biocore/scikit-bio but I don't know why.
> cheers
> carlos
> Dr. Carlos Peña
> Laboratory of Genetics
> Department of Biology
> University of Turku
> 20014 Turku
> * Associate Editor: Revista peruana de Biología
> http://is.gd/TwbW
> * The Nymphalidae Systematics Group
> http://nymphalidae.utu.fi/db.php
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython-dev

More information about the Biopython-dev mailing list