[Biojava-dev] AlphabetManager.createSymbol(...)

David Huen smh1008 at cam.ac.uk
Thu Feb 15 12:16:49 UTC 2007


On Feb 15 2007, mark.schreiber at novartis.com wrote:

>
>A similar suggestion has been made in the past for indexing SymbolLists in 
>terms of BigInteger. How practical would such a large alphabet be? Eg 
>unless you expect it to be pretty sparse in terms of the number of 
>possible symbols that are actually seen you might get major problems with 
>memory.
>
I think it is practical in the sense that even a simple (AA)^10 alphabet 
will exceed the range of int but an alignment of 10 proteins may only be, 
say, 1000 residues long so only a max of 1000 symbols will ever be 
instantiated with much fewer needing to remain instantiated throughout the 
run. I see less point for SymbolLists in that it seems unlikely that any 
chromosome could have more than an int's worth of bases.

The main reason I need these huge alphabets is for 1-D HMMs that run over 
genome alignments. I also hope to internally representing symbols in these 
alphabets by BigInteger values of their alphabet index.

Incidentally, the SparseCrossProductAlphabet appeared to be caching every 
symbol it was ever asked for and I have changed that to a WeakValueHashMap 
internally now.

Regards,
David


-- 
David Huen
Dept of Genetics
University of Cambridge
CB2 3EH
U.K.




More information about the biojava-dev mailing list