[Biojava-l] Generalized HMM in biojava?

Matthew Pocock matthew.pocock at ncl.ac.uk
Thu Jan 12 08:27:10 EST 2006


On Wednesday 11 January 2006 16:03, wendy wong wrote:
> Thanks!
> Now I have two questions about the SimpleEmissionState class:
>
> 1. advance: I am not entirely sure what it does. So if my state emits
> 4 symbols at a time do I set it to {4}?

If you are emitting 4 symbols at a time, then you should probably think of the 
sequence as being a string of 4-tuples. In this case, the advance would be 
{1 }, as you emit a single 4-tuple each time.

>
> 2. Each of my sites can emit up to more than 100 alphabets 

I think we are using different words here. Do you mean 100 alphabets, or 
alphabets containing 100 symbols?

> and if 
> each state emits 4 symbols at a time the number of alphabet for each
> state is 100^4. I am a bit concerned about setting up the
> distributions (too much memory consumption?).

Well, there's no way arround this. If you realy want to estimate a full 
discrete distribution over 4-tuples over 100 symbols, then you will have 
100^4 parameters to estimate.

The alternative is to estimate a much smaller number of variables which when 
combined together (e.g. by multiplying them) calculate the full set of 
parameters. With a little thinking, You can rig the  distribution trainer to 
route the counts back from the 100^4 possible outcomes to the underlying 
parameters.

It would probably help to have a better idea what it is you are attempting to 
model.

> Is there a function that 
> I can overload so that the probability of each emission alphabet can
> be calculated on the run?

It's not the alphabet that will kill you, but the number of parameters you are 
estimating. Indeed, BioJava should be able to handle alphabets with more than 
2^32 symbols quite happily. There's an implementation of cross-product 
alphabet designed especially for this case.

>
> Thanks for your help!
>
> wendy
>
> On 1/11/06, Matthew Pocock <matthew.pocock at ncl.ac.uk> wrote:
> > If each state emits a fixed number of symbols then you can just do an HMM
> > where the emissions are over alpha^length. If you want the symbols to
> > overlap then use an order-n distribution.
> >
> > Matthew
> >
> > On Wednesday 11 January 2006 09:37, wendy wong wrote:
> > > what I mean by Generalized HMM is that each state emits a sequence of
> > > symbols (fixed length though), which doesn't seen very straight
> > > forward in biojava?
> > >
> > > thanks,
> > > wendy
> > >
> > > On 1/11/06, mark.schreiber at novartis.com <mark.schreiber at novartis.com> 
wrote:
> > > > Depending on what you mean by generalized....
> > > >
> > > > You can create lots of custom HMM architechtures using the DP
> > > > packages of biojava.
> > > >
> > > > - Mark
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > wendy wong <wendy.wong at gmail.com>
> > > > Sent by: biojava-l-bounces at portal.open-bio.org
> > > > 01/11/2006 05:00 AM
> > > > Please respond to sww8
> > > >
> > > >
> > > >         To:     biojava-l at biojava.org
> > > >         cc:     (bcc: Mark Schreiber/GP/Novartis)
> > > >         Subject:        [Biojava-l] Generalized HMM in biojava?
> > > >
> > > >
> > > > Hi,
> > > >
> > > > I was wondering if it is possible to use the biojava library to
> > > > construct a generalized HMM?
> > > >
> > > > thanks,
> > > > Wendy
> > > >
> > > > _______________________________________________
> > > > Biojava-l mailing list  -  Biojava-l at biojava.org
> > > > http://biojava.org/mailman/listinfo/biojava-l
> > >
> > > _______________________________________________
> > > Biojava-l mailing list  -  Biojava-l at biojava.org
> > > http://biojava.org/mailman/listinfo/biojava-l


More information about the Biojava-l mailing list