[Biojava-dev] AlphabetManager.createSymbol(...)
David Huen
smh1008 at cam.ac.uk
Tue Feb 13 13:47:03 UTC 2007
Hi, The current implementation of the above for basis symbols creates a
symbol then caches it. I suggest that this is an undesirable behaviour.
First, it is quite possible for a cross-product behaviour to have a
potentially huge number of symbols and that a significant fraction of these
can be instantiated once only, e.g. when reading thru a 12-species genome
alignment. Caching every instantiated cross-product symbol under these
circumstances is very expensive on memory and also pointless.
Next, the existing cache is a Map keyed on a list of Symbols. This forces
all caching to run off this implementation which can be inefficient for
certain alphabets.
I propose to change the behaviour to leave all symbol implementation
details and caching in cross-product/basis alphabets (including uniqueness
checking) to the alphabet implementation. Are there any implications that I
may not have considered (is it OK with serialisation?). Or objections? I
think this change can be done without breaking the API.
Another change which I would like to be considered at some future stage (BJ
2.0?) is a means of dealing with really large alphabets (think DNA**n). The
size of the alphabet can readily exceed the limits of an int and therefore
a solution will require breaking our FiniteAlphabet and AlphabetIndex APIs.
I propose some extension that allows returning results for size and index
in terms of BigInteger.
Regards,
David
--
David Huen
Dept of Genetics
University of Cambridge
CB2 3EH
U.K.
More information about the biojava-dev
mailing list