[Biojava-dev] Re: [Biojava-l] position weight matrix

Fri Sep 19 05:47:21 EDT 2003

Schreiber, Mark wrote:

>Not wanting to argue but i'd love to hear why :)
> 
>- Mark
>  
>
Ok - I'll try to give a coherent explanation, with each point in no 
particular order. Sometimes I wish my maths was better.

To a biologist, an n means that one of the four nucleotitdes could be 
present, and Y means one of two could be present, and so on. To a 
statistician, I guess they would want to fit some probability or 
expectation (depending on classical or baysean) to those possibilities.

HMMs are generative models. We use probabilistic HMMs for modeling the 
sequences, but realy what we are doing is comparing the sequences to all 
of those that the HMM generates, and getting joint probabilities that 
the HMM made a sequence like that and that we had a sequence like that 
in the first place, and that HMM (which is where all those pesky priors 
and posteriors come from).

If we like the idea of generative grammars, we can treat a sequence 
containing ambiguity symbols as a generative model. We could if we wish 
iterate over all possible matching sequences composed entirely of 
attomic symbols. So:

  antt

can be expanded to the four sequences

  aatt
  actt
  agtt
  attt

In fact, this is exactly what we do for some of the sequence searching 
objects that provide regular-expression functionality.

When dealing with the distribution objects, the probability of observing 
either a,g,c or t is going to be 1 - we must observe something, and 
these are the only possibilities. This means that an n is uninformative 
in allowing us to compare the likelihood of one distribution against 
another - they will both produce 1. However, if we look at the log odds 
for this - dividing out the null model, we get the number 0. The odds 
scores for the simple symbols will be rather more interesting - positive 
when the distribution fits the data better, and negative when the null 
model fits it better. Here, the 0 value is doing something usefull - 
it's just saying that that symbol is uninformative - neither giving 
support to the model or the null model.

So - to cut a long story short, since our distribution objects are realy 
PDFs over sets of symbols, and HMMs are PDFs over sequences, and we can 
use log odds to make ambiguities turn into sane numbers, taking into 
account a null model, it is easier to make sequences containing 
ambiguity symbols behave as generative grammars, and sum over all 
sequences generated by these grammars for all HMM-related math 
(including HMMs and distributions). This way we have one world view, and 
all the sums work out without fudge-factor code being shot-gunned across 
the project.

Is that clear, or have I garbled it again?

Matthew