[Biojava-dev] Current Alphabet design and an unintended consequence
for training
David Huen
david.huen at ntlworld.com
Sat Jan 3 09:32:53 EST 2004
The current Alphabet system uses the BasisSymbol to represent both ambiguity
symbols and symbols from cross product alphabets. This casues unintended
consequences for training algorithms.
The DistributionTrainerContext has an addCount(Distribution dist, Symbol
sym, double times) method. When using cross product alphabets, it works
flawlessly when it encounters AtomicSymbols from the cross-product alphabet
(and these are also BasisSymbols). In the design of the
DistributionTrainer interface, the equivalent method addCount(Distribution
dist, AtomicSymbol sym, double times) accepts only an AtomicSymbol, which
is reasonable.
However, when training two-head distributions, it is not implausible for the
DistributionTrainerContext.addCount() to receive Symbols that are not
AtomicSymbols. The most common by far would be symbols emitted by gap
states of form e.g (gap cytosine). The current implementation of the
addCount method assumes that non-atomic symbols are ambiguity symbols and
attempts to deal with them in that manner. Evidently it fails in the above
case, indeed, it fails silently. This problem currently prevents the
training of PairDistributions in which one component Distribution is a
GapDistribution.
There appears to be no easy way of fixing this problem at the level of
DistributionTrainerContext. It is formally possible that the BasisSymbol
received by addCount is truly an ambiguity symbol containing a number of
symbols from the cross-product alphabet of the two-head HMM model. It is
also possible that the BasisSymbol represents a single symbol comprising
ambiguity symbol(s) from one or both alphabets that form the cross product
alphabet. The two are evidently not equivalent and have to be dealt with
differently. And resolving which it is is potentially computationally
costly for an operation that is repeated very many times during training.
Even if this ambiguity could be resolved at the level of
DistributionTrainerContext and you knew the symbol to be one of type (gap
<something else>), that symbol cannot be passed to a DistributionTrainer
that may be capable of dealing with it as the addCount method in that
interface accepts only atomic symbols which something like (gap guanine) is
not.
Interim solutions could be:-
1) change the DistributionTrainer.addCount() to accept non-atomic symbols.
DistributionTrainerContext's addCount method will leave it to the
distribution trainers to sort out what to do with non-atomic symbols
themselves.
OR
2) add a ExtendedDistributionTrainer interface with one method addCount that
can accept non-atomic symbols. DistributionTrainerContext's addCount
method will check whether the symbol it receives is atomic. If it is, it
will use the standard DistributionTrainer.addCount(). If not, it will
determine if the trainer for that distribution implements the
ExtendedDistributionTrainer interface and if so, call that interface's
addCount method to leave it to deal with the symbol. If not, it will
assume that the symbol is an ambiguity symbol and deal with it in the
manner it does now.
(2) is probably less disruptive to existing code and interfaces. It may be
that the DistributionTrainer is a better place to deal with non-atomic
symbols than DistributionTrainerContext since that the DT knows more about
the internals of that Distribution and what it can/should handle while the
DTC has to of necessity implement a one-size-fits-all approach.
At Biojava 2, it may be worthwhile to revisit the Alphabet design and
explicitly distinguish ambiguity symbols and BasisSymbols on the level that
the former is a Set of symbols, while the latter is a List of Symbols.
Regards,
David Huen
More information about the biojava-dev
mailing list