[Biojava-dev] Current Alphabet design and an unintended consequencefor training

Mon Jan 12 12:55:17 EST 2004

/me puts on alphabet guru hat

I would tend to agree with Mark. It is unfortunate that we don't have a 
'gap' interface. Occasionally relational logic and object models realy 
don't mix. Pants.

/me takes off alphabet guru hat

mark schreiber wrote:

>David,
>
>I think you may be able to use solution 1. As Symbol is a superclass
>(interface) of AtomicSymbol you won't be breaking any API.
>
>- Mark
>
>
>  
>
>>-----Original Message-----
>>From: biojava-dev-bounces at portal.open-bio.org 
>>[mailto:biojava-dev-bounces at portal.open-bio.org] On Behalf Of 
>>David Huen
>>Sent: Saturday, 3 January 2004 10:33 p.m.
>>To: biojava-dev at biojava.org
>>Subject: [Biojava-dev] Current Alphabet design and an 
>>unintended consequencefor training
>>
>>The current Alphabet system uses the BasisSymbol to represent 
>>both ambiguity symbols and symbols from cross product 
>>alphabets.  This casues unintended consequences for training 
>>algorithms.
>>
>>The DistributionTrainerContext has an addCount(Distribution 
>>dist, Symbol sym, double times) method.  When using cross 
>>product alphabets, it works flawlessly when it encounters 
>>AtomicSymbols from the cross-product alphabet (and these are 
>>also BasisSymbols).  In the design of the DistributionTrainer 
>>interface, the equivalent method addCount(Distribution dist, 
>>AtomicSymbol sym, double times) accepts only an AtomicSymbol, 
>>which is reasonable.
>>
>>However, when training two-head distributions, it is not 
>>implausible for the
>>DistributionTrainerContext.addCount() to receive Symbols that 
>>are not AtomicSymbols.  The most common by far would be 
>>symbols emitted by gap states of form e.g (gap cytosine).  
>>The current implementation of the addCount method assumes 
>>that non-atomic symbols are ambiguity symbols and attempts to 
>>deal with them in that manner.  Evidently it fails in the 
>>above case, indeed, it fails silently.  This problem 
>>currently prevents the training of PairDistributions in which 
>>one component Distribution is a GapDistribution.
>>
>>There appears to be no easy way of fixing this problem at the 
>>level of   
>>DistributionTrainerContext.  It is formally possible that the 
>>BasisSymbol received by addCount is truly an ambiguity symbol 
>>containing a number of symbols from the cross-product 
>>alphabet of the two-head HMM model.  It is also possible that 
>>the BasisSymbol represents a single symbol comprising 
>>ambiguity symbol(s) from one or both alphabets that form the 
>>cross product alphabet.  The two are evidently not equivalent 
>>and have to be dealt with differently.  And resolving which 
>>it is is potentially computationally costly for an operation 
>>that is repeated very many times during training.
>>
>>Even if this ambiguity could be resolved at the level of 
>>DistributionTrainerContext and you knew the symbol to be one 
>>of type (gap <something else>), that symbol cannot be passed 
>>to a DistributionTrainer that may be capable of dealing with 
>>it as the addCount method in that interface accepts only 
>>atomic symbols which something like (gap guanine) is not.
>>
>>Interim solutions could be:-
>>1) change the DistributionTrainer.addCount()  to accept 
>>non-atomic symbols.  
>>DistributionTrainerContext's addCount method will leave it to 
>>the distribution trainers to sort out what to do with 
>>non-atomic symbols themselves.  
>>OR
>>2) add a ExtendedDistributionTrainer interface with one 
>>method addCount that can accept non-atomic symbols.  
>>DistributionTrainerContext's addCount method will check 
>>whether the symbol it receives is atomic.  If it is, it will 
>>use the standard DistributionTrainer.addCount().  If not, it 
>>will determine if the trainer for that distribution 
>>implements the ExtendedDistributionTrainer interface and if 
>>so, call that interface's addCount method to leave it to deal 
>>with the symbol.  If not, it will assume that the symbol is 
>>an ambiguity symbol and deal with it in the manner it does now.
>>
>>(2) is probably less disruptive to existing code and 
>>interfaces.  It may be that the DistributionTrainer is a 
>>better place to deal with non-atomic symbols than 
>>DistributionTrainerContext since that the DT knows more about 
>>the internals of that Distribution and what it can/should 
>>handle while the DTC has to of necessity implement a 
>>one-size-fits-all approach.
>>
>>At Biojava 2, it may be worthwhile to revisit the Alphabet 
>>design and explicitly distinguish ambiguity symbols and 
>>BasisSymbols on the level that the former is a Set of 
>>symbols, while the latter is a List of Symbols.
>>
>>Regards,
>>David Huen
>>
>>
>>
>>
>>
>>_______________________________________________
>>biojava-dev mailing list
>>biojava-dev at biojava.org
>>http://biojava.org/mailman/listinfo/biojava-dev
>>
>>    
>>
>_______________________________________________
>biojava-dev mailing list
>biojava-dev at biojava.org
>http://biojava.org/mailman/listinfo/biojava-dev
>
>  
>