[Biojava-dev] AlphabetManager.createSymbol(...)

mark.schreiber at novartis.com mark.schreiber at novartis.com
Thu Feb 15 09:31:08 UTC 2007


>Hi, The current implementation of the above for basis symbols creates a 
>symbol then caches it. I suggest that this is an undesirable behaviour.
>
>First, it is quite possible for a cross-product behaviour to have a 
>potentially huge number of symbols and that a significant fraction of 
these 
>can be instantiated once only, e.g. when reading thru a 12-species genome 

>alignment. Caching every instantiated cross-product symbol under these 
>circumstances is very expensive on memory and also pointless.
>
>Next, the existing cache is a Map keyed on a list of Symbols. This forces 

>all caching to run off this implementation which can be inefficient for 
>certain alphabets.
>
>I propose to change the behaviour to leave all symbol implementation 
>details and caching in cross-product/basis alphabets (including 
uniqueness 
>checking) to the alphabet implementation. Are there any implications that 
I 
>may not have considered (is it OK with serialisation?). Or objections? I 
>think this change can be done without breaking the API.

I think as long as it is well documented how to do the caching that would 
be fine. Would you keep cahcing for core alphabets like DNA?

I suspect it might cause problems with serialisation but it might be 
avoidable. As long as there are unit tests for serialisation of both 
cached and unchached alphabets it should be OK. Careful attention might be 
needed for Gaps??

>Another change which I would like to be considered at some future stage 
(BJ 
>2.0?) is a means of dealing with really large alphabets (think DNA**n). 
The 
>size of the alphabet can readily exceed the limits of an int and 
therefore 
>a solution will require breaking our FiniteAlphabet and AlphabetIndex 
APIs. 
>I propose some extension that allows returning results for size and index 

>in terms of BigInteger.

A similar suggestion has been made in the past for indexing SymbolLists in 
terms of BigInteger. How practical would such a large alphabet be? Eg 
unless you expect it to be pretty sparse in terms of the number of 
possible symbols that are actually seen you might get major problems with 
memory.

- Mark

>Regards,
>David

-- 
David Huen
Dept of Genetics
University of Cambridge
CB2 3EH
U.K.

_______________________________________________
biojava-dev mailing list
biojava-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-dev






More information about the biojava-dev mailing list