[Bioperl-l] Hidden Markov Model in Bioperl?
Yee Man Chan
ymc at paxil.stanford.edu
Mon Mar 28 13:14:56 EST 2005
On Mon, 28 Mar 2005, Aaron J. Mackey wrote:
> Yes, in bioperl-ext, of course ...
That was my intention to add it to bioperl-ext.
>
> On Mar 25, 2005, at 6:49 PM, Yee Man Chan wrote:
>
> > I am thinking of an interface like this:
> >
> > Bio::Tools::HMM->new("symbols", "states")
> > - instantiate an HMM object with a string of symbols (each character
> > corresponds to one symbol) and a string of states. Other parameters of
> > the
> > model is generated randomly. Good for starting a Baum-Welch training.
>
> Why not expand this to be two arrayrefs of symbols or states? You can
> convert them into whatever encoded single-char alphabet you'd like.
> Think Perl, not C. This is a feature request, not a requirement, of
> course.
I thought about that too. But I suppose this is an HMM for Bioperl and I
don't see any usage outside DNA sequences and protein sequences. So maybe
strings are ok? It can be quite tedious if I need to convert a DNA string
to an array of DNA characters to use HMM. Can you give me some biological
examples that can justify this feature request?
>
> > Bio::Tools::HMM->ObsSeqProb("string of observed sequence")
> > - return the probability of an observed sequence.
>
> This is the Forward algorithm P()? Perhaps an alias to Forward(), and
> the ability to specify an offset/index at which you want the Forward
> value (see below)? Or is this the product of viterbi factors?
>
This is the P(O|lambda), ie given an HMM model and an observed sequence,
what is the probability of seeing this observed sequence. It is equivalent
to sum_1_to_N alpha_T(i) where alpha is the forward function, T is the
length of observed sequence and N is the number of hidden states.
Forward and Backward functions are hidden from this interface for now.
Oh. Should I return this as log(P)? For a sequence of just couple hundred
symbols, P tends to be very close to zero, so maybe log(P) will make more
sense to users?
> > Bio::Tools::HMM->Viterbi("string of observed sequence")
> > - return a string of hidden sequence that maximize the probability of
> > the
> > happening of the observed sequence.
>
> this might also return the P() of the viterbi path; and again, instead
> of returning string of symbols, an arrayref of symbols.
>
Based on my understanding of the literature, I don't recall seeing any
effort to compute the probability of the hidden state sequence.
> > Bio::Tools::HMM->getInitArray()
> > Bio::Tools::HMM->getStateMatrix()
> > Bio::Tools::HMM->getEmissionMatrix()
>
> Presumably these should be get/set methods?
>
Yeah. I should do both get and set.
> What's missing is 1) posterior decoding and 2) partial path probability
> (i.e. F_{i}*v_{i+1}*v+{i+2}*...v*_{j-1}*B_{j}/F_{x}, where i < j, F and
> B are Forward and Backward values, v's are viterbi factors for each
> step in the partial path specified from i to j)
>
I can add posterior_decoding but I am not sure what
partial_path_probability is. Can you give me a link to some information
about it?
> I'd also prefer lower case names (BaumWelch could just be called
> "train" or "learn_unsupervised" or somesuch)
I have two ways to train the HMM, one is without hidden state sequence
supplied (ie BaumWelchTraining) and one is with hidden state sequence (ie
StatisticalTraining). Is the former learn_unsupervised and the latter
learn_supervised in the AI speak?
Regards,
Yee Man
>
> Also, see the HMM functions available in Matlab that do the same ...
>
> Good luck,
>
> -Aaron
>
> --
> Aaron J. Mackey, Ph.D.
> Dept. of Biology, Goddard 212
> University of Pennsylvania email: amackey at pcbi.upenn.edu
> 415 S. University Avenue office: 215-898-1205
> Philadelphia, PA 19104-6017 fax: 215-746-6697
>
More information about the Bioperl-l
mailing list