[Bioperl-l] Re: translation using Bioperl

Andrew Dalke dalke@acm.org
Tue, 25 Jul 2000 15:40:33 -0600


From: Will Fischer <wfischer@indiana.edu>
>Correct behavior (IMHO) would be to check whether the first codon
>matches a valid start (in the genetic code being used): if yes, 
>put in Met; otherwise, put in the default amino-acid and
>(perhaps) complain.  

Short answer, I agree except that it's impossible for bioperl, as
a library, to do this.  It is the responsibility of users of library
to decide what to do when the first codon isn't a start codon.
The difficulty is knowing the "perhaps" part.  Detail below.

When I implemented the translation code for biopython (...

  >Since (shame on me) I haven't played with the relevant modules yet,
  >I can't give you a modest or immodest opinion about how it should be
  >coded. 

  and I did look at the bioperl and biojava groups for reference, but
  have since forgotten what I learned - see the back biopython mailings
  for my opinions :)

...) I was was concerned that I didn't know the source of the sequence.
It could be a full biological sequence or a subsequence of the same.

Algebraically, I wanted
     translation(seq) == translation(seq[:x]) + translation(seq[x:])
(or  atom_count(seq) == atom_count(seq[:x]) + atom_count(seq[x:])
 or  mw(seq) == mw(seq[:x]) + mw(seq[x:])
) where x is some arbitrary cut point of the sequence.

You can only put on the 'M' if you know the sequence is not the
start of the biological sequence.  In order to do that you need to start
tracking if the ends are biological ends or not.  I experimented with
tracking that data since I also wanted to get the the mw and atom counts
very accurate (+/- the weight of the O-H and the H).

It turned out to be extremely complicated.  For example, suppose you
transliterated a subsequence of DNA to RNA then translated the RNA
to protein.  The transliteration function has to be told if it mimics
biological cutting or not, that is, are the ends of the RNA real ends
or just arbitrary algorithm ends?  If the end of the DNA was real and
the conversion to RNA says the same end is chosen arbitrarily, is the
end still "real" or is it arbitrary now?

Even worse, if you've done homology searches and BLAST comes back with
a similar sequence.  Do you have to go back to the data base to find
out if the subsequence's ends are real?  Suppose the database doesn't
keep those records itself (eg, it's an arbitrary partitioning of a
larger sequence).  So you end up with "this end is biological", "this
end is algorithmatic" and "the end type is unknown."  Doubled since
two ends, and you see why it's hairy.

All of this was done so that algebra works as expected.  The solution
I came up with was to ignore the status of the ends, assume everything
is biologically non-terminal, and force the caller to know how to adjust
the results to get the biologically relevant answer.  So instead of
tracking all the details, I punted and forced someone else to do the
hard work.  That was the only thing I could come up with that didn't
make hard-to-intuit guesses along the way.

For relevance to this topic, I agree that it's the relevant behaviour,
but not of the bioperl or biopython tools.  If you consider those codes
as a library, it's the responsibility of the person using the library to
make the call on what to do.  The library should merrily translate away
and the *calling* code detect "hmm, this doesn't start off with an M, I
think I'll complain."

                    Andrew
                    dalke@acm.org