[Bioperl-l] Hidden Markov Model in Bioperl?

Yee Man Chan ymc at paxil.stanford.edu
Mon Mar 28 13:14:56 EST 2005



On Mon, 28 Mar 2005, Aaron J. Mackey wrote:

> Yes, in bioperl-ext, of course ...

That was my intention to add it to bioperl-ext.

> 
> On Mar 25, 2005, at 6:49 PM, Yee Man Chan wrote:
> 
> > 	I am thinking of an interface like this:
> >
> > Bio::Tools::HMM->new("symbols", "states")
> > - instantiate an HMM object with a string of symbols (each character
> > corresponds to one symbol) and a string of states. Other parameters of 
> > the
> > model is generated randomly. Good for starting a Baum-Welch training.
> 
> Why not expand this to be two arrayrefs of symbols or states?  You can 
> convert them into whatever encoded single-char alphabet you'd like.  
> Think Perl, not C.  This is a feature request, not a requirement, of 
> course.

I thought about that too. But I suppose this is an HMM for Bioperl and I
don't see any usage outside DNA sequences and protein sequences. So maybe
strings are ok? It can be quite tedious if I need to convert a DNA string
to an array of DNA characters to use HMM. Can you give me some biological
examples that can justify this feature request?

> 
> > Bio::Tools::HMM->ObsSeqProb("string of observed sequence")
> > - return the probability of an observed sequence.
> 
> This is the Forward algorithm P()?  Perhaps an alias to Forward(), and 
> the ability to specify an offset/index at which you want the Forward 
> value (see below)?  Or is this the product of viterbi factors?
> 

This is the P(O|lambda), ie given an HMM model and an observed sequence,
what is the probability of seeing this observed sequence. It is equivalent
to sum_1_to_N alpha_T(i) where alpha is the forward function, T is the
length of observed sequence and N is the number of hidden states.

Forward and Backward functions are hidden from this interface for now.

Oh. Should I return this as log(P)? For a sequence of just couple hundred
symbols, P tends to be very close to zero, so maybe log(P) will make more
sense to users?

> > Bio::Tools::HMM->Viterbi("string of observed sequence")
> > - return a string of hidden sequence that maximize the probability of 
> > the
> > happening of the observed sequence.
> 
> this might also return the P() of the viterbi path; and again, instead 
> of returning string of symbols, an arrayref of symbols.
> 

Based on my understanding of the literature, I don't recall seeing any
effort to compute the probability of the hidden state sequence.

> > Bio::Tools::HMM->getInitArray()
> > Bio::Tools::HMM->getStateMatrix()
> > Bio::Tools::HMM->getEmissionMatrix()
> 
> Presumably these should be get/set methods?
> 

Yeah. I should do both get and set.

> What's missing is 1) posterior decoding and 2) partial path probability 
> (i.e. F_{i}*v_{i+1}*v+{i+2}*...v*_{j-1}*B_{j}/F_{x}, where i < j, F and 
> B are Forward and Backward values, v's are viterbi factors for each 
> step in the partial path specified from i to j)
> 

I can add posterior_decoding but I am not sure what
partial_path_probability is. Can you give me a link to some information 
about it?

> I'd also prefer lower case names (BaumWelch could just be called 
> "train" or "learn_unsupervised" or somesuch)

I have two ways to train the HMM, one is without hidden state sequence
supplied (ie BaumWelchTraining) and one is with hidden state sequence (ie
StatisticalTraining). Is the former learn_unsupervised and the latter
learn_supervised in the AI speak?

Regards,
Yee Man

> 
> Also, see the HMM functions available in Matlab that do the same ...
> 
> Good luck,
> 
> -Aaron
> 
> --
> Aaron J. Mackey, Ph.D.
> Dept. of Biology, Goddard 212
> University of Pennsylvania       email:  amackey at pcbi.upenn.edu
> 415 S. University Avenue         office: 215-898-1205
> Philadelphia, PA  19104-6017     fax:    215-746-6697
> 



More information about the Bioperl-l mailing list