[EMBOSS] protein sequence input

Wed Jul 28 14:13:37 UTC 2004

Hi Bobby,

> I am using the emboss to make protein sequence
> analysis.
>
> I want the program "water"(smith-waterman algorithm)
> to take in the characters "J","O","U" which are not
> aminoacid symbols.
>
> Can I change the code?, If so, in which file I have to
> make this change, to make the program take this
> desired input

Interesting question. The sequence types are checked in ajax/ajseqtype.c

But there is also the question of whether your sequence is really a protein.

Perhaps we should allow "alpha" as a sequence type, with its own
comparison matrices. It could get complicated (we need to check whether we
assume all non-nucleic sequences are protein, for example).

"U" is a valid protein code (for selenocysteine). "O" is used as a gap
character by some formats. "J" is not used. I have seen "O" and "J" used
in modified matrices before - though that was as DNA to score CpG islands
differently (CG was converted to OJ and given higher match scores)

Perhaps, as a quick solution, you could try using the protein ambiguity
codes B and Z instead of O and J? Then you could use a normal protein
sequence.

Hope that helps,

Peter