[Biojava-l] adding counts to Dists

Thomas Down td2@sanger.ac.uk
Thu, 13 Dec 2001 11:14:54 +0000


On Thu, Dec 13, 2001 at 10:33:52PM +1300, Mark Schreiber wrote:
> Hi -
> 
> When adding a large number of counts to a Distribution via a trainer i
> have found it is much quicker to store the counts in and array (indexed by
> the AlphabetIndex for that alphabet). Increment the counts as each symbol
> comes in and then add the counts to the trainer at the end. (followed by
> the .train() method).
> 
> I'm curious as to why this is. I assume its cause the trainer checks the
> validity of each symbol, although technically so does the AlphabetIndex by
> looking up the index for the symbol.
> 
> Not that this is a major issue it might just be a way to speed up
> distribution training

Do you know what what implementation of Distribution you're
using?  SimpleDistribution uses a fairly sensible DistributionTrainer
object (which uses an Indexed and an array -- pretty much the
same as you are).  However, I notice that there's also something
called SimpleDistributionTrainer.  This is storing counts in
a Map<Symbol, Double>, and I suspect is likely to be /much/
less efficient -- especially as there's object churn every time
a new count is added.

If the distribution you're using is still using a SimpleDistributuinTrainer,
I'd guess that could cause some fairly dreadful performance.

    Thomas.