[Biojava-dev] Re: [Biojava-l] position weight matrix
Matthew Pocock
matthew_pocock at yahoo.co.uk
Fri Sep 19 05:47:21 EDT 2003
Schreiber, Mark wrote:
>Not wanting to argue but i'd love to hear why :)
>
>- Mark
>
>
Ok - I'll try to give a coherent explanation, with each point in no
particular order. Sometimes I wish my maths was better.
To a biologist, an n means that one of the four nucleotitdes could be
present, and Y means one of two could be present, and so on. To a
statistician, I guess they would want to fit some probability or
expectation (depending on classical or baysean) to those possibilities.
HMMs are generative models. We use probabilistic HMMs for modeling the
sequences, but realy what we are doing is comparing the sequences to all
of those that the HMM generates, and getting joint probabilities that
the HMM made a sequence like that and that we had a sequence like that
in the first place, and that HMM (which is where all those pesky priors
and posteriors come from).
If we like the idea of generative grammars, we can treat a sequence
containing ambiguity symbols as a generative model. We could if we wish
iterate over all possible matching sequences composed entirely of
attomic symbols. So:
antt
can be expanded to the four sequences
aatt
actt
agtt
attt
In fact, this is exactly what we do for some of the sequence searching
objects that provide regular-expression functionality.
When dealing with the distribution objects, the probability of observing
either a,g,c or t is going to be 1 - we must observe something, and
these are the only possibilities. This means that an n is uninformative
in allowing us to compare the likelihood of one distribution against
another - they will both produce 1. However, if we look at the log odds
for this - dividing out the null model, we get the number 0. The odds
scores for the simple symbols will be rather more interesting - positive
when the distribution fits the data better, and negative when the null
model fits it better. Here, the 0 value is doing something usefull -
it's just saying that that symbol is uninformative - neither giving
support to the model or the null model.
So - to cut a long story short, since our distribution objects are realy
PDFs over sets of symbols, and HMMs are PDFs over sequences, and we can
use log odds to make ambiguities turn into sane numbers, taking into
account a null model, it is easier to make sequences containing
ambiguity symbols behave as generative grammars, and sum over all
sequences generated by these grammars for all HMM-related math
(including HMMs and distributions). This way we have one world view, and
all the sums work out without fudge-factor code being shot-gunned across
the project.
Is that clear, or have I garbled it again?
Matthew
More information about the biojava-dev
mailing list