[EMBOSS] backtranseq

Fri Jul 22 12:52:49 UTC 2005

Nadeem Faruque wrote:

> I think we'd be better off with plain old IUPAC rather than venturing into more comples systems or we'll end up with 
> weighted matrices or even HMM's.
> The advantage of IUPAC is of course that you can plug it into most other programs.

Well .... how about this part of IUPAC:

IUBMB recommends marking unclear codons, for example in
http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html

"To avoid ambiguity, therefore, it is important to make it clear whenever the 
triplet YTN, for example, occurs in a sequence deduced from the occurrence of 
a leucine residue in the corresponding amino acid sequence that it does not 
include TTT or TTC as possibilities, etc. To emphasise this, it may be helpful 
to print such triplets in italics."

... we could use lowercase, rather than italics, to make this clear.

IUPAC also allows uncertain positions with (A,C,D) or (H.I.K.L). EMBOSS allows 
these, but after checking all occurrences in PIR it simply ignores the extra 
characters and assumes the amino acids are in the correct sequence. These are 
needed because Sanger protein sequencing determined composition but usually 
not the order of residues.

I see no codes for a choice of amino acids, other than B (D or N) and Z (E or 
Q), both from amino acid sequence composition, where hydrolyzing all amide 
bonds converted N to D (Asparagine to Aspartate) and Q to E (glutamine to 
glutamate). Also, one IUPAC report notes that NMR data can include J for "I or 
L" as Leucine and Isoleucine are indistinguishable by NMR. EBMOSS so far 
ignores this code (I only discovered it today :-).

U is now officially used for selenocysteine, although many EMBOSS programs 
cannot handle U and have to use X. The only character not used in amino acid 
sequence is O. I have seen it used in DNA sequence (CpG islands represented as 
OJ for specialised alignment scoring in one publication).