[Biojava-l] RE: Bug in HashedAlphabetIndex??

Matthew Pocock mrp@sanger.ac.uk
Wed, 07 Mar 2001 15:59:42 +0000


Hi Mark,

I've fixed this on the main trunk. Thomas, could you port this to the 
1.1 branch?

The two issues this brings up are

a) I think that the SymbolList FiniteAlphabet.symbols() is unnecisary. 
If you want to iterate over an alphabet or find its size, you can just 
use the methods in FiniteAlphabet. If you wish to impose some ordering 
on the FiniteAlphabet, then you can use an AlphabetIndexer object 
obtainable via AlphabetManager.getAlphabetIndex(alpha). I have 
depricated this method, but I think it should remain un-depricated on 
the release branch.

b) The default distribution objects construct a distribution with as 
many parameters as there are symbols in your alphabet, and one fewer 
free parameters (as they must sum to 1). I have a gut feeling that 
building a probability distribution over a very large number of symbols 
(e.g. > DNA hexamers) is probably silly (although i've now tested it for 
dna^7), as you won't have enough data. This may mandate the use of 
custom Distribution implementations that do clever data-smoothing, or 
that have far fewer parameters (e.g. make dna^7 using a function of a 
simple dna^2 matrix).

Anyway, thanks for finding the bug. Feel free to test the new code to 
destruction and find more interesting behaviours.

Matthew