[Biojava-l] RE: Bug in HashedAlphabetIndex??
Matthew Pocock
mrp@sanger.ac.uk
Wed, 07 Mar 2001 15:59:42 +0000
Hi Mark,
I've fixed this on the main trunk. Thomas, could you port this to the
1.1 branch?
The two issues this brings up are
a) I think that the SymbolList FiniteAlphabet.symbols() is unnecisary.
If you want to iterate over an alphabet or find its size, you can just
use the methods in FiniteAlphabet. If you wish to impose some ordering
on the FiniteAlphabet, then you can use an AlphabetIndexer object
obtainable via AlphabetManager.getAlphabetIndex(alpha). I have
depricated this method, but I think it should remain un-depricated on
the release branch.
b) The default distribution objects construct a distribution with as
many parameters as there are symbols in your alphabet, and one fewer
free parameters (as they must sum to 1). I have a gut feeling that
building a probability distribution over a very large number of symbols
(e.g. > DNA hexamers) is probably silly (although i've now tested it for
dna^7), as you won't have enough data. This may mandate the use of
custom Distribution implementations that do clever data-smoothing, or
that have far fewer parameters (e.g. make dna^7 using a function of a
simple dna^2 matrix).
Anyway, thanks for finding the bug. Feel free to test the new code to
destruction and find more interesting behaviours.
Matthew