[Biopython-dev] Bio.Seq: implementing of translation of gapped sequences

Tue Nov 3 14:54:29 UTC 2015

Hi all,

The pull request from Carlos for gap codon translation is here:

https://github.com/biopython/biopython/pull/661

The proposed behaviour adds a gap argument to the translate
method (should this be gap_char to match the alphabet object?),
and will look at the alphabet by default.

I've written up some examples here:

https://github.com/biopython/biopython/pull/661#issuecomment-153376803

There is one potentially surprising change to existing behaviour,
previously the gap codon would always raise a translation error:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_dna, Gapped
>>> Seq("ACT---TAA", Gapped(generic_dna)).translate()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "Bio/Seq.py", line 1004, in translate
    stop_symbol, to_stop, cds)
  File "Bio/Seq.py", line 2064, in _translate_str
    "Codon '{0}' is invalid".format(codon))
Bio.Data.CodonTable.TranslationError: Codon '---' is invalid

If the gap character is explicit via the alphabet this becomes:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_dna, Gapped
>>> Seq("ACT---TAA", Gapped(generic_dna)).translate()
Seq('T-*', HasStopCodon(Gapped(ExtendedIUPACProtein(), '-'), '*'))

Or using a different gap character,

>>> Seq("ACT~~~TAA", Gapped(generic_dna, "~")).translate()
Seq('T~*', HasStopCodon(Gapped(ExtendedIUPACProtein(), '~'), '*'))

I think this is a small change, and worth while overall for the
useful functionality - something I was planning to add myself
at some point but had never gotten round to.

Thoughts? Feedback? You can use GitHub if you prefer:

https://github.com/biopython/biopython/pull/661

Regards,

Peter

On Tue, Nov 3, 2015 at 8:26 AM, Carlos Peña <mycalesis at gmail.com> wrote:
> Hi all,
>
>
> I have prepared a pull request to try implementing the translation of gapped
> sequences (thanks Peter for guidelines!).
>
> The code will infer the gap character from the Seq object's given alphabet.
> If the alphabet is not present if can optionally accept a gap character,
> then it will return a protein sequence with gaps, otherwise it will raise a
> Translation error.
>
> At least for me, the change will allow simplify the code of my projects.
> However, this implementation might bite you if you are expecting a
> TranslationError from trying to translate a gapped sequence. Instead you
> will get back a gapped protein sequence (if you gap consists of dashes "-").
>
> Is this change desirable for the Biopython project? I noticed that the
> scikit-bio project does not implement gapped translations:
> https://github.com/biocore/scikit-bio but I don't know why.
>
>
> cheers
>
>
> carlos
>
>
> Dr. Carlos Peña
> Laboratory of Genetics
> Department of Biology
> University of Turku
> 20014 Turku
> FINLAND
>
> * Associate Editor: Revista peruana de Biología
> http://is.gd/TwbW
>
> * The Nymphalidae Systematics Group
> http://nymphalidae.utu.fi/db.php
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython-dev