[Biojava-dev] Suggestion for Canonical Symbols
Thomas Down
td2@sanger.ac.uk
Mon, 9 Dec 2002 00:00:20 +0000
On Mon, Dec 09, 2002 at 11:59:01AM +1300, Schreiber, Mark wrote:
> Hi -
>
> If you translate and RNA SymbolList into Protein the Symbols in the
> protein SymbolList come from the alphabet referenced by the
> ProteinTools.getTAlphabet.
>
> The Symbols from the Talphabet are not canonical with the Symbols from
> the other protein Alphabet. This has lead to some very surprising bugs
> in some stuff we were developing. Given that Integer Symbols are now
> canonical even if they come from IntegerAlphabet or one of the
> Integer.SubAlphabets could the same happen for the protein Alphabets?
*sigh*
That was actually the original behaviour. I broke it
(deliberately) a few weeks ago when fixing the knotty
question of serializing ambiguous symbols, so now you know
who to blame. At the time, requiring that all well-known
symbols should be scoped by Alphabet provided a sane way
of cleaning up the serialization code without having to write
totally new Symbol and Alphabet implementations for all the
well-known cases. At least in the Protein/protein-term
case is probably does make sense to fix this. I shall
ponder -- all suggestions welcome.
The division between protein and protein-term is really
rather articificial. As far as I can tell, the termination
symbol is a bit like the gap symbol, in that it never occurs
in "biologically real" sequences, but is a useful convenience
for computation. Maybe we'll be able to build on that idea for
BJ2 and get rid of the annoying distinction.
Thomas.