[Biojava-dev] Distributions Gaps and Residual counts

mark.schreiber at group.novartis.com mark.schreiber at group.novartis.com
Mon Mar 1 22:05:25 EST 2004


Hi -

I have modified (hacked might be a better word) AbstractDistribution to 
take advantage of a weakness in the setWeights() method of Distribution so 
that Distributions can hold the weight of Gaps. Following up on a tip for 
which Thomas Down deserves the credit.

When you set the weights of Symbols using the setWeights() method there is 
no contract that says the weights have to add to one. Weights can be less 
than one which means some residual weight is not assigned to any Symbol. 
For this reason it is reccommended that Distributions are trained using a 
DistributionTrainerContext so this behaivour is avoided. You wouldn't 
notice this residual weight unless you tried to sample from the 
Distribution in which case you could get an exception if it tried to 
return the Symbol that didn't exist.

I have changed the AbstractDistribution so that any residual weight is 
assigned to the Gap symbol. You can get the weight of the gap Symbol with 
the getWeight() method. You don't really need to set it as you can just 
set the weights of the other Symbols and leave some room for the gap as 
residual weight. You cannot train gaps in as ultimately training requires 
Symbols to be reduced to AtomicSymbols. It is not possible to make Gap 
Atomic without changing half the Symbol and Distribution API's which 
seemed pretty tiresome. I would vote for gaps being atomic in any redesign 
of biojava. I generally don't reccomend you play around with residual 
weight but I post it here to inform any keen developer of the possibility.

Anyhow, the reason for this hack is that it allows the DistributionTools 
method distOverAlignment() to keep track of the frequency of gaps at any 
position in an Alignment. This is probably the only reccommended way of 
producing a Distribution with gap weight. The change probably makes 
setWeight() slightly safer to use and if you see gaps in your sample()s 
then you have probably messed up setting weights in you Distribution.

Let me know if this change causes any unexpected strangeness.

Mark Schreiber
Principal Scientist (Bioinformatics)

Novartis Institute for Tropical Diseases (NITD)
1 Science Park Road
#04-14 The Capricorn, Science Park II
Singapore 117528

phone +65 6722 2973
fax  +65 6722 2910



More information about the biojava-dev mailing list