[Biojava-dev] Current Alphabet design and an
unintended consequencefor training
Matthew Pocock
matthew_pocock at yahoo.co.uk
Mon Jan 12 12:55:17 EST 2004
/me puts on alphabet guru hat
I would tend to agree with Mark. It is unfortunate that we don't have a
'gap' interface. Occasionally relational logic and object models realy
don't mix. Pants.
/me takes off alphabet guru hat
mark schreiber wrote:
>David,
>
>I think you may be able to use solution 1. As Symbol is a superclass
>(interface) of AtomicSymbol you won't be breaking any API.
>
>- Mark
>
>
>
>
>>-----Original Message-----
>>From: biojava-dev-bounces at portal.open-bio.org
>>[mailto:biojava-dev-bounces at portal.open-bio.org] On Behalf Of
>>David Huen
>>Sent: Saturday, 3 January 2004 10:33 p.m.
>>To: biojava-dev at biojava.org
>>Subject: [Biojava-dev] Current Alphabet design and an
>>unintended consequencefor training
>>
>>The current Alphabet system uses the BasisSymbol to represent
>>both ambiguity symbols and symbols from cross product
>>alphabets. This casues unintended consequences for training
>>algorithms.
>>
>>The DistributionTrainerContext has an addCount(Distribution
>>dist, Symbol sym, double times) method. When using cross
>>product alphabets, it works flawlessly when it encounters
>>AtomicSymbols from the cross-product alphabet (and these are
>>also BasisSymbols). In the design of the DistributionTrainer
>>interface, the equivalent method addCount(Distribution dist,
>>AtomicSymbol sym, double times) accepts only an AtomicSymbol,
>>which is reasonable.
>>
>>However, when training two-head distributions, it is not
>>implausible for the
>>DistributionTrainerContext.addCount() to receive Symbols that
>>are not AtomicSymbols. The most common by far would be
>>symbols emitted by gap states of form e.g (gap cytosine).
>>The current implementation of the addCount method assumes
>>that non-atomic symbols are ambiguity symbols and attempts to
>>deal with them in that manner. Evidently it fails in the
>>above case, indeed, it fails silently. This problem
>>currently prevents the training of PairDistributions in which
>>one component Distribution is a GapDistribution.
>>
>>There appears to be no easy way of fixing this problem at the
>>level of
>>DistributionTrainerContext. It is formally possible that the
>>BasisSymbol received by addCount is truly an ambiguity symbol
>>containing a number of symbols from the cross-product
>>alphabet of the two-head HMM model. It is also possible that
>>the BasisSymbol represents a single symbol comprising
>>ambiguity symbol(s) from one or both alphabets that form the
>>cross product alphabet. The two are evidently not equivalent
>>and have to be dealt with differently. And resolving which
>>it is is potentially computationally costly for an operation
>>that is repeated very many times during training.
>>
>>Even if this ambiguity could be resolved at the level of
>>DistributionTrainerContext and you knew the symbol to be one
>>of type (gap <something else>), that symbol cannot be passed
>>to a DistributionTrainer that may be capable of dealing with
>>it as the addCount method in that interface accepts only
>>atomic symbols which something like (gap guanine) is not.
>>
>>Interim solutions could be:-
>>1) change the DistributionTrainer.addCount() to accept
>>non-atomic symbols.
>>DistributionTrainerContext's addCount method will leave it to
>>the distribution trainers to sort out what to do with
>>non-atomic symbols themselves.
>>OR
>>2) add a ExtendedDistributionTrainer interface with one
>>method addCount that can accept non-atomic symbols.
>>DistributionTrainerContext's addCount method will check
>>whether the symbol it receives is atomic. If it is, it will
>>use the standard DistributionTrainer.addCount(). If not, it
>>will determine if the trainer for that distribution
>>implements the ExtendedDistributionTrainer interface and if
>>so, call that interface's addCount method to leave it to deal
>>with the symbol. If not, it will assume that the symbol is
>>an ambiguity symbol and deal with it in the manner it does now.
>>
>>(2) is probably less disruptive to existing code and
>>interfaces. It may be that the DistributionTrainer is a
>>better place to deal with non-atomic symbols than
>>DistributionTrainerContext since that the DT knows more about
>>the internals of that Distribution and what it can/should
>>handle while the DTC has to of necessity implement a
>>one-size-fits-all approach.
>>
>>At Biojava 2, it may be worthwhile to revisit the Alphabet
>>design and explicitly distinguish ambiguity symbols and
>>BasisSymbols on the level that the former is a Set of
>>symbols, while the latter is a List of Symbols.
>>
>>Regards,
>>David Huen
>>
>>
>>
>>
>>
>>_______________________________________________
>>biojava-dev mailing list
>>biojava-dev at biojava.org
>>http://biojava.org/mailman/listinfo/biojava-dev
>>
>>
>>
>_______________________________________________
>biojava-dev mailing list
>biojava-dev at biojava.org
>http://biojava.org/mailman/listinfo/biojava-dev
>
>
>
More information about the biojava-dev
mailing list