[Biojava-l] Mixing order of models
Matthew Pocock
mrp@sanger.ac.uk
Thu, 11 Oct 2001 15:14:51 +0100
Thomas Down wrote:
> On Thu, Oct 11, 2001 at 01:01:20PM +1300, Mark Schreiber wrote:
>
>>Hi -
>>
>>A while ago I asked a question about mixing the order of Markov Models.
>>The reason I want to do this is that some regions contain enough sequence
>>to make 2nd order models other regions can easily make 5th order models. I
>>seem to remember someone suggesting the use of redundant hexamers to mimic
>>trimers (or something), how would this work??
>>
>
> Well, it's certainly possible to mix orders of models, by using
> distributions which are high-order, but where some (or all)
> of the conditioning symbols are ignored. I guess that's equivalent
> to what you say about redundant hexamers.
>
> Training such a distribution might be a little more interesting,
> but it should be possible to write a DistributionTrainer which
> picks an appropriate order of distribution, based on the amount
> of supporting evidence. In a nice, Bayesian world, this should
> just be a question of using a sensible prior. (Matthew: am I
> missing something here).
>
You are correct. In a perfect world, you would chose a prior over order
(something like an exponential decay, for example), and estimate what
order could be supported by regions of the model. You could for a single
state mix 1st order through to n'th order probabilities using
an estimation of a mixing function (mixed with the prior to make it
sensible) and an estimation of the individual distributions guessed from
the training data. You should be able to convince it to collapse down to
a single order (or even take the maximum likelihood estimate of order
for a state). Yuck. It wouldn't require the DP package to be changed in
any way - you'd just have to write some Distribution instances.
Ultimately, you would want to use some integration over all state
machines over all emission orders symultaneously, and learn a
probability distribution over these machines that makes your data look
likely. This is way beyond my maths, but with suitable training routines
ends up collapsing back into the finite state machine world.
The quick and dirty solution (if you know already which regions of a
model are what order) is to define the entire model in the highest order
necisary, and then make the lower order distributions degenerate. You
can save on time/memory by writing a small Distribution implementation
that delegates onto an underlying Distribution after trimming the
cross-product symbol appropreately. Don't forget to register two
trainers, one for the delegate distribution, and one for your
implementation that forwards on the counts for the trimmed symbols.
Here is some code that won't compile, but will give you strong hints.
You may wish to check the trimming distribution code into BioJava once
it works.
Matthew
ChoppingDistribution implements Distribution {
private Distribution targetDist;
private FiniteAlphabet alpha;
private int order;
private FiniteAlphabet targetAlpha;
private int targetOrder;
/* "cast" a symbol from the distribution alphabet to the
* target alphabet
*
* ps should we be taking the first or last symbols?
* check NthOrderDistribution to make it the same.
*/
private AtomicSymbol downCast(AtomicSymbol s)
throws IllegalSymbolException {
return targetAlpha.getSymbol(
s.getSymbols().subList(0, targetOrder)
);
}
public double getWeight(AtomicSymbol as)
throws IllegalSymbolException {
return targetDist.getWeigth(as, downcast(as));
}
}