[Biojava-l] Mixing order of models

Matthew Pocock mrp@sanger.ac.uk
Thu, 11 Oct 2001 15:14:51 +0100


Thomas Down wrote:

> On Thu, Oct 11, 2001 at 01:01:20PM +1300, Mark Schreiber wrote:
> 
>>Hi -
>>
>>A while ago I asked a question about mixing the order of Markov Models.
>>The reason I want to do this is that some regions contain enough sequence
>>to make 2nd order models other regions can easily make 5th order models. I
>>seem to remember someone suggesting the use of redundant hexamers to mimic
>>trimers (or something), how would this work??
>>
> 
> Well, it's certainly possible to mix orders of models, by using
> distributions which are high-order, but where some (or all)
> of the conditioning symbols are ignored.  I guess that's equivalent
> to what you say about redundant hexamers.
> 
> Training such a distribution might be a little more interesting,
> but it should be possible to write a DistributionTrainer which
> picks an appropriate order of distribution, based on the amount
> of supporting evidence.  In a nice, Bayesian world, this should
> just be a question of using a sensible prior.  (Matthew: am I
> missing something here).
> 


You are correct. In a perfect world, you would chose a prior over order 
(something like an exponential decay, for example), and estimate what 
order could be supported by regions of the model. You could for a single 
state mix 1st order through to n'th order probabilities using
an estimation of a mixing function (mixed with the prior to make it 
sensible) and an estimation of the individual distributions guessed from 
the training data. You should be able to convince it to collapse down to 
a single order (or even take the maximum likelihood estimate of order 
for a state). Yuck. It wouldn't require the DP package to be changed in 
any way - you'd just have to write some Distribution instances.

Ultimately, you would want to use some integration over all state 
machines over all emission orders symultaneously, and learn a 
probability distribution over these machines that makes your data look 
likely. This is way beyond my maths, but with suitable training routines 
ends up collapsing back into the finite state machine world.

The quick and dirty solution (if you know already which regions of a 
model are what order) is to define the entire model in the highest order 
necisary, and then make the lower order distributions degenerate. You 
can save on time/memory by writing a small Distribution implementation 
that delegates onto an underlying Distribution after trimming the 
cross-product symbol appropreately. Don't forget to register two 
trainers, one for the delegate distribution, and one for your 
implementation that forwards on the counts for the trimmed symbols.

Here is some code that won't compile, but will give you strong hints. 
You may wish to check the trimming distribution code into BioJava once 
it works.

Matthew

ChoppingDistribution implements Distribution {
   private Distribution targetDist;

   private FiniteAlphabet alpha;
   private int order;

   private FiniteAlphabet targetAlpha;
   private int targetOrder;

   /* "cast" a symbol from the distribution alphabet to the
   * target alphabet
   *
   * ps should we be taking the first or last symbols?
   * check NthOrderDistribution to make it the same.
   */
   private AtomicSymbol downCast(AtomicSymbol s)
   throws IllegalSymbolException {
     return targetAlpha.getSymbol(
       s.getSymbols().subList(0, targetOrder)
     );
   }

   public double getWeight(AtomicSymbol as)
   throws IllegalSymbolException {
     return targetDist.getWeigth(as, downcast(as));
   }
}