[Biojava-dev] Current Alphabet design and an unintended consequence for training

Sat Jan 3 09:32:53 EST 2004

The current Alphabet system uses the BasisSymbol to represent both ambiguity 
symbols and symbols from cross product alphabets.  This casues unintended 
consequences for training algorithms.

The DistributionTrainerContext has an addCount(Distribution dist, Symbol 
sym, double times) method.  When using cross product alphabets, it works 
flawlessly when it encounters AtomicSymbols from the cross-product alphabet 
(and these are also BasisSymbols).  In the design of the 
DistributionTrainer interface, the equivalent method addCount(Distribution 
dist, AtomicSymbol sym, double times) accepts only an AtomicSymbol, which 
is reasonable.

However, when training two-head distributions, it is not implausible for the 
DistributionTrainerContext.addCount() to receive Symbols that are not 
AtomicSymbols.  The most common by far would be symbols emitted by gap 
states of form e.g (gap cytosine).  The current implementation of the 
addCount method assumes that non-atomic symbols are ambiguity symbols and 
attempts to deal with them in that manner.  Evidently it fails in the above 
case, indeed, it fails silently.  This problem currently prevents the 
training of PairDistributions in which one component Distribution is a 
GapDistribution.

There appears to be no easy way of fixing this problem at the level of   
DistributionTrainerContext.  It is formally possible that the BasisSymbol 
received by addCount is truly an ambiguity symbol containing a number of 
symbols from the cross-product alphabet of the two-head HMM model.  It is 
also possible that the BasisSymbol represents a single symbol comprising 
ambiguity symbol(s) from one or both alphabets that form the cross product 
alphabet.  The two are evidently not equivalent and have to be dealt with 
differently.  And resolving which it is is potentially computationally 
costly for an operation that is repeated very many times during training.

Even if this ambiguity could be resolved at the level of 
DistributionTrainerContext and you knew the symbol to be one of type (gap 
<something else>), that symbol cannot be passed to a DistributionTrainer 
that may be capable of dealing with it as the addCount method in that 
interface accepts only atomic symbols which something like (gap guanine) is 
not.

Interim solutions could be:-
1) change the DistributionTrainer.addCount()  to accept non-atomic symbols.  
DistributionTrainerContext's addCount method will leave it to the 
distribution trainers to sort out what to do with non-atomic symbols 
themselves.  
OR
2) add a ExtendedDistributionTrainer interface with one method addCount that 
can accept non-atomic symbols.  DistributionTrainerContext's addCount 
method will check whether the symbol it receives is atomic.  If it is, it 
will use the standard DistributionTrainer.addCount().  If not, it will 
determine if the trainer for that distribution implements the 
ExtendedDistributionTrainer interface and if so, call that interface's 
addCount method to leave it to deal with the symbol.  If not, it will 
assume that the symbol is an ambiguity symbol and deal with it in the 
manner it does now.

(2) is probably less disruptive to existing code and interfaces.  It may be 
that the DistributionTrainer is a better place to deal with non-atomic 
symbols than DistributionTrainerContext since that the DT knows more about 
the internals of that Distribution and what it can/should handle while the 
DTC has to of necessity implement a one-size-fits-all approach.

At Biojava 2, it may be worthwhile to revisit the Alphabet design and 
explicitly distinguish ambiguity symbols and BasisSymbols on the level that 
the former is a Set of symbols, while the latter is a List of Symbols.

Regards,
David Huen