[Biopython-dev] [Bug 2381] translate and transcibe methods for the Seq object (in Bio.Seq)

Thu Nov 6 15:27:07 UTC 2008

http://bugzilla.open-bio.org/show_bug.cgi?id=2381

------- Comment #45 from biopython-bugzilla at maubp.freeserve.co.uk  2008-11-06 10:27 EST -------
(In reply to comment #43)
> (In reply to comment #39)
> > I would be happy with EITHER of these options, as both can be used to
> > translate a complete coding sequence:
> > 
> > (1) the "init" argument (under another name, maybe "cds_start"?)
> > illustrated in attachment 1032.  This would check the start
> > codon is valid AND translate it as a methionine.
> > 
> > (2) the "complete_cds" argument (perhaps under another name, maybe "cds"?)
> > illustrated in this patch.  This would check the start codon is valid AND
> > translate it as a methionine AND check there are a whole number of codons
> > AND check it ends with a stop codon AND check there are no extra in-frame
> > stop codons.
> > 
> 
> 
> I support (1) but strongly disagree with (2) because 'cds' refers to
> a complete DNA sequence not just if the sequence starts with M.
> http://www.yeastgenome.org/help/glossary.html
> "CDS:    CoDing Sequence, region of nucleotides that corresponds to the
> sequence of amino acids in the predicted protein. The CDS includes start and
> stop codons, therefore coding sequences begin with an "ATG" and end with a
> stop codon. In SGD, unexpressed sequences, including the 5'-UTR, the 3'-UTR,
> introns, or bases not expressed due to frameshifting, are not included within
> a CDS. Note that the CDS does not correspond to the actual mRNA sequence."

Starting with that definition but being aware of atypical start codons gives:

"The CDS includes start and stop codons, therefore coding sequences begin with
an "ATG" [or other valid start codon] and end with a stop codon."

This then fits exactly with what I'm doing in the "complete_cds" option
(attachment 1040).  So why the disagreement?

> However, I do like being able to obtain the translation of the actual
> CDS - just not here.

Back in comment 11, I previously mooted having separate methods like
translate_to_stop, and translate_cds - but we currently seem to be leaning
towards one method with some options.

> I do not support the name 'init' because of reasons discussed. 

I think that is settled, "init" is too ambiguous.

> I do not support the name 'cds_start' because of the DNA interpretation and
> that many Genbank records include the upstream and downstream non-coding
> regions. In such cases, I would have to find the actual start codon, then I
> might as well do the translation after that start codon than rely on a check
> that might be wrong.

In such cases, if your sequence might includes upstream and downstream
non-coding regions, then you shouldn't be trying to use the "init"/"cds_start"
option (or the "complete_cds" option).  By the nature of your uncertain
dataset, you'll have to do some extra work to find the start/stop.  I don't see
how this is an argument against providing an option useful for when you do know
where the CDS starts (or do already have the CDS).

> Perhaps some variant of:
> a) Similar cases in Python:
> has_met or has_met1
> get_met or get_met1
> b) More direct meaning:
> starts_with_methionine, starts_with_met, starts_with_m
> 

I'd been avoiding names with methionine in them, preferring to focus on
initiation or start codon based names.

I guess "starts_with_met" is OK.  Or maybe "start_met"?

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.