[EMBOSS] backtranseq
Peter Rice
pmr at ebi.ac.uk
Fri Jul 22 12:52:49 UTC 2005
Nadeem Faruque wrote:
> I think we'd be better off with plain old IUPAC rather than venturing into more comples systems or we'll end up with
> weighted matrices or even HMM's.
> The advantage of IUPAC is of course that you can plug it into most other programs.
Well .... how about this part of IUPAC:
IUBMB recommends marking unclear codons, for example in
http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html
"To avoid ambiguity, therefore, it is important to make it clear whenever the
triplet YTN, for example, occurs in a sequence deduced from the occurrence of
a leucine residue in the corresponding amino acid sequence that it does not
include TTT or TTC as possibilities, etc. To emphasise this, it may be helpful
to print such triplets in italics."
... we could use lowercase, rather than italics, to make this clear.
IUPAC also allows uncertain positions with (A,C,D) or (H.I.K.L). EMBOSS allows
these, but after checking all occurrences in PIR it simply ignores the extra
characters and assumes the amino acids are in the correct sequence. These are
needed because Sanger protein sequencing determined composition but usually
not the order of residues.
I see no codes for a choice of amino acids, other than B (D or N) and Z (E or
Q), both from amino acid sequence composition, where hydrolyzing all amide
bonds converted N to D (Asparagine to Aspartate) and Q to E (glutamine to
glutamate). Also, one IUPAC report notes that NMR data can include J for "I or
L" as Leucine and Isoleucine are indistinguishable by NMR. EBMOSS so far
ignores this code (I only discovered it today :-).
U is now officially used for selenocysteine, although many EMBOSS programs
cannot handle U and have to use X. The only character not used in amino acid
sequence is O. I have seen it used in DNA sequence (CpG islands represented as
OJ for specialised alignment scoring in one publication).
More information about the EMBOSS
mailing list