[Biopython-dev] unambiguous DNA

Thu Aug 10 12:07:41 EDT 2000

thomas at cbs.dtu.dk:
>I'd like to get 'X' instead of a '*' (stop signal) when there is no
>clear translation ...(when extracting all possible ORFs from raw - often
>pure - sequence data during e.g. complete genome seqeuncing projects)

Is this a common enough need to standard code to support it?
If so, I can think of a couple different ways.

1) As I described in my reply, there could be a new alphabet
encoding containing the 'X' character as an ambiguous amino acid?

If so, should stop codons still be translated to "*"  That is,
should
  NATGATTANAATNTATTCCATTATATTGTTTAR
be translated to
  XDXNXFHYIV*  (with a stop encoded alphabet using "*")
or
  XDXNXFHYIVX  (with just the straight "X" extended alphabet)
?
Or should there be two different classes of translator objects
available, one for each request?  (I would rather not, and instead
use a converter object to strip out the StopEncoded part.)

2) The translator object could acquire a third forward translation
method (in addition to "translate" and "translate_to_stop") perhaps
named "translate_ignoring_stop".  The code would be something like:

    def translate_ignoring_stop(self, seq, ignore_symbol = "X"):
        assert seq.alphabet == self.table.nucleotide_alphabet, \
               "cannot translate from the given alphabet (%s)" %
seq.alphabet
        s = seq.data
        letters = []
        append = letters.append
        table = self.table
        get = table.forward_table.get
        n = len(seq)
        for i in range(0, n-n%3, 3):
            try:                                      # Change
                append(get(s[i:i+3], ignore_symbol))  # Change
            except TranslationError:                  # Change
                append(ignore_symbol)                 # Change
        # return with the correct alphabet encoding (cache the encoding)
        try:
            alphabet = self._ignore_encoded[ignore_symbol]  # Change
        except KeyError:
            # UnknownEncoded doens't currently exist, but easy to make
            alphabet = Alphabet.UnknownEncoded(table.protein_alphabet) #
Change
            self._ignore_encoded[stop_symbol] = alphabet  # Change

        return Seq.Seq(string.join(letters, ""), alphabet)

Of course, the back_translate method would need to be told how to deal
with UnknownEncoded which is hard with the current code.  'X' isn't
part of the protein alphabet so it can't be passed to the codon table's
reverse lookup, which expects one of the alphabet letters or 'None'
for a stop codon.

What could be done is to get the protein_alphabet from the codon table,
sort it, and append 'None' to the list.  (The sort is to guarantee
a consistent order no matter the codon table implementation in the
future.)  Then when 'X' is found, choose successive letters from the
sorted list, looping as needed.  This would get you a better looking
result, although the statistics will be wrong.  What I don't like
about it is the back translation way allows the codon table to return
a statisically appropriate result, while what I outlined above doesn't.

I like (2) because it's easy to understand, but it does have that
statistical problem, so I would go with (1) even though it may lead
to a proliferation of slightly different alphabets.  On the third
hand, the codon table could be changed to have some way to return
the statistically appropriate result, like a new method.  (It could
use the method I outlined above, except there would need to be some
way to reset the loop through the alphabet so successive calls to
back_translate the same sequence could always give the same results.)

Now that I think about it, I like that third option the best.  To
repeat; codon tables will have a new method which returns a generator
for randomly picked back translations.  This generator implements a
method (codon() ?) which returns a (possibly statistically appropriate)
nucleotide codon.  The back translate code would look like:

  def _back_translate_ignore(self, seq):
        s = seq.data
        letter = seq.alphabet.unknown_symbol
        letters = []
        append = letters.append
        table = self.table.back_table
        back_gen = self.table.back_generator()
        for c in seq.data:
            if c == letter:
                append(back_gen.codon())
            else:
                append(table[c])
        return Seq.Seq(string.join(letters, ""),
                       self.table.nucleotide_alphabet)

                    Andrew