[Biojava-dev] AlphabetManager.createSymbol(...)

Tue Feb 13 13:47:03 UTC 2007

Hi, The current implementation of the above for basis symbols creates a 
symbol then caches it. I suggest that this is an undesirable behaviour.

First, it is quite possible for a cross-product behaviour to have a 
potentially huge number of symbols and that a significant fraction of these 
can be instantiated once only, e.g. when reading thru a 12-species genome 
alignment. Caching every instantiated cross-product symbol under these 
circumstances is very expensive on memory and also pointless.

Next, the existing cache is a Map keyed on a list of Symbols. This forces 
all caching to run off this implementation which can be inefficient for 
certain alphabets.

I propose to change the behaviour to leave all symbol implementation 
details and caching in cross-product/basis alphabets (including uniqueness 
checking) to the alphabet implementation. Are there any implications that I 
may not have considered (is it OK with serialisation?). Or objections? I 
think this change can be done without breaking the API.

Another change which I would like to be considered at some future stage (BJ 
2.0?) is a means of dealing with really large alphabets (think DNA**n). The 
size of the alphabet can readily exceed the limits of an int and therefore 
a solution will require breaking our FiniteAlphabet and AlphabetIndex APIs. 
I propose some extension that allows returning results for size and index 
in terms of BigInteger.

Regards,
David

-- 
David Huen
Dept of Genetics
University of Cambridge
CB2 3EH
U.K.