[Biopython-dev] [Bug 2381] translate and transcibe methods for the Seq object (in Bio.Seq)

Tue Nov 4 17:43:53 UTC 2008

http://bugzilla.open-bio.org/show_bug.cgi?id=2381

------- Comment #34 from bsouthey at gmail.com  2008-11-04 12:43 EST -------
(In reply to comment #33)
> (In reply to comment #32)
> > > In which of these examples do you understand that the first position is
> > > being forced to a Methionine?
> 
> With my suggested code, you would not just be forcing the first codon to be a
> methionine.  You would also be asking for the first codon to be validated as a
> start codon (initialisation codon).
> 
> > None are particularly clear, but only one of them doesn't give me the wrong
> > idea...
> 
> In some cases I seem to have guessed different possible meanings for some of
> these suggested names - so those are probably unclear.
> 
> > > >>> translate("TTGAAACCCTAG", init=True, to_stop=True)
> > 
> > Because I've read this thread (or looked at the docs) - I understand this one
> > ;)
> 
> To me this suggests something special is happening with the initialisation of
> the translation - but I agree its not clear what without checking the
> documentation.
> 
> > > >>> translate("TTGAAACCCTAG", force_as_translating=True, to_stop=True)
> > 
> > I don't intuitively understand this.  Does it mean that the sequence should be
> > translatable?
> 
> Ditto - an argument called force_as_translating means nothing to me.  You're
> calling a translation method so what can forcing a translation mean?
> 
> > > >>> translate("TTGAAACCCTAG", force_methionine=True, to_stop=True)
> > 
> > Does this mean that the sequence will be translated from the first methionine
> > the method finds?
> 
> I would have guessed force_methionine would ignore the value of the first three
> nucleotides in order to treat them as a methionine (even if they are not a
> start codon).
> 
> > > >>> translate("TTGAAACCCTAG", force_methionine=True, force_stop=True)
> > 
> > As above, and does force_stop mean that you add a '*' to the end of the
> > translation?  Or that you stop at a stop codon?
> 
> Like Leighton, I would be confused by "force_stop".  It could mean add a stop
> symbol to the end of the amino acid sequence even if there isn't one there
> already.
> 
> > > >>> translate("TTGAAACCCTAG", alt_start=True, alt_stop=True)
> > 
> > 'alt_start' I would think referred to allowing translation from alternative
> > start codons.  I don't know what alt_stop would mean...
> 
> I think "alt_start" would be misleading for the intended dual functionality. 
> Consider the typical use case for this option - translating a CDS, which most
> of the time will use the typical start codon AUG / ATG (but not all ways). 
> We'd want the start codon validated - and it often won't be an alternative
> start codon.  So calling the argument "alt_start" is confusing.
> 
> > > Also, I don't think this option will be used very often. 
> > 
> > Maybe not.  The first use case that comes to mind is QA on CDS-finding:
> > 
> > # Check if sequence is CDS:
> > assert candidate_cds.translate(init=True)
> > # Check if reported CDS start is valid
> > assert est[37:].translate(init=True)
> > 
> > A second use case is slower in presenting itself...
> 
> I think translating a CDS is quite a common task - so a very long argument
> would be bad.
> 
> Instead of the "init" start codon option in attachment 1032 [details], I'd also be happy
> with a single boolean argument which does start codon validation, treats this
> as a methionine, checks the sequence is a multiple of three in length, checks
> for a final stop codon, and checks for no additional stop codons.  We'd ruled
> out calling this "complete", but maybe "cds" would be better?
> 
> > > So, it shouldn't be a problem if its name is too long to type, and it would
> > > be better if it is easy to understand.
> > 
> > That's a fair argument, I think.  On the whole, though, I would favour a
> > short, unambiguous, slightly cryptic name over a very long, unambiguous
> > name, over an ambiguous name of any length.
> 
> There is a lot of subjectiveness in argument naming - clearly we have not come
> up with a perfect suggestion yet.
> 
> Unfortunately "init" can be misunderstood (I'm not 100% sure what you were
> trying to say in comment 31, but I think you thought from the name "init" could
> be some sort of optional optimisation initialisation).
> 
> How about "cds_start" instead of "init"?
> 

As I think about this and the various comments, I do that you must apply the
same reasoning to non-standard translation as was applied to the ORF finding
comments. From that I understand that you want a basic translation function so
function arguments like to_stop or cds_start would be inappropriate. Also, even
if it was possible, I do not see that validating all known start codons under
all genetic codes fits here.

Rather I think the various comments reflect various combinations of three major
steps:

1) Identify the region to be translated like NCBI's sequence viewer: range from
'begin' to 'end' to denote the region to be viewed. Under this view, start_from
or begin_at could be the position to start or the first occurrence of a start
codon. Likewise to_end or end_at could be a position or the first occurrence of
a stop codon. I also note this also implies frame but I think that has a
separate meaning.

2) Having defined the region to be translated, translate that region as defined
by the frame and selected table. A question here is that if region is defined
then should the frame be set to one or not.

3) Address any non-standard codons to the translated sequence. If you are going
to allow non-standard start codons, you also need to handle selenocysteine
(http://en.wikipedia.org/wiki/Selenocysteine) and less so pyrrolysine
(http://en.wikipedia.org/wiki/Pyrrolysine). Technically, you can argue the
table used for translation in 2) should reflect this but I consider it a
separate issue. Also, the occurrence of a stop codon would likewise need to
change.

The non-standard codon usages are rare and I do really question if these are
really part of the Seq object translate function or belong elsewhere. I really
feel that if the user already knows that it is a non-AUG start codon then they
can replace the first amino acid with Met rather than rely on the translate
function. For example, the CDS field in the Genbank record for Mouse
Neuropeptide W (NM_001099664) has:
/exception="alternative start codon"
/note="non-AUG (CUG) translation initiation codon".
So if the user looked at the record then then would know it would need to be
changed.

If some form of the non-standard codons is included I would think some variant
of Leighton's assert idea should be preferred such as using an
assert_nonstandard argument (or just nonstandard). This would be a string, list
or tuple to denote the changes to be made such as say 'Met1' or 'M1' where
three or single letter code of the desired amino acid and the number is the
location within the amino acid sequence to be changed. So Met1 would mean
changing the amino acid at position one with Methionine (M). But I recognize
this is not sufficient to handle other non-standard cases with stop codons.

Bruce

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.