[Biojava-l] High Order HMM

Sat Dec 23 12:08:58 UTC 2006

> Hi,
>
> New to HMMs and BioJava, so what I'm asking for is probably a dumb question.
> But I figure it better to ask it rather than sit here and be puzzled...
>
> >From the wiki article
> http://www.biojava.org/wiki/BioJava:Tutorial:Dynamic_programming_examples
> and the post http://portal.open-bio.org/pipermail/biojava-l/2006-March/005387.html
>
> I get the sense that in order to create a third-order HMM, reading a
> protein sequence, and emitting symbols (e.g. create an alphabet
> TriGreek from "alpha","beta","delta"), you would need to create one
> state for each amino acid, and associate each state with a
> OrderNDistribution using a cross product alphabet as in
> AlphabetManager.generateCrossProductAlphaFromName("(Protein x Protein
> x TriGreek)").
>
> So if you walked through a trimer AGF which emitted "alpha", you would
> end in the state "F", which uses a OrderNDistribution where the first
> protein (in the cross product alphabet) corresponds to the "A", the
> second protein corresponds to the "G", and the last term corresponds
> to "alpha."
>

Your problem sounds like you are trying to estimate observations of
amino acid delta based on the previous 2 observations (a second order
model). Thus you would use a OrderNDistribution in which p(Delta) is
conditioned on ProteinxProtein.

> This seems odd, so what I don't get, is should I be mixing emissions
> with previous states in the cross product alphabet to create a third
> order HMM? Or is there a better way?

An alternative would be to have your states emit 3 amino acids at
once. This would be a normal Distribution over the alphabet
proteinXproteinXprotein. Each amino acid triple would be completely
independant of the previous triple. This is not the same as the
OrderNAlphabet which emits single amino acids based on the previous
two.

>
> I'm even more confused about how to define transition weights.
>

Each state contains a Distribution of States. These states are from
the Alphabet of States that the state is connected to. The State
classes implement Symbol so can belong to Alphabets. The Distribution
of States gives the probability of transitioning to each State in the
Alphabet of States that the origin state connects to.

If your model is fully ergodic each state connects to every other
state so the transition Alphabet contains every other state (in fact
in fully ergodic models states can connect to themselves so the
transition Alphabet would include all states including the Magic
state). If you model has a more complex architecture then the
transition Alphabet will include only the states you can transition
to.

Hope this helps.

> Obviously, I'm wrong about something...  How do you define
> states/distributions in a third order HMM?
>
> Thanks
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>