[Dynamite] SingleModel with metasequences

Ian Holmes ihh@fruitfly.org
Wed, 8 Mar 2000 01:28:35 -0800 (PST)


On Wed, 8 Mar 2000, Ewan Birney wrote:

> > Here's my latest SingleModel. I'm quite pleased with it. It replaces
> > Ewan's ComplexSequence with the equally badly-named Metasequence (better
> > name would be welcomed!).
> 
> <grin>. Metasequence isn't so bad. But neither is complexsequence.

;) I still reckon diehard design patterns types would burn us both at the
same stake.

> > Also, I moved the emission scores so they're now associated with states,
> > not transitions. I think this is what we're all used to...
> 
> Not me! I *always* think of emissions on the transitions. Please - emit
> on transitions. Otherwise we get burnt in
> 	- genewise (transitions into MATCH have different emissions)
> 	- HMMER (J state)
> 	- everything else I do ;)

Are you sure about HMMER?

It's not a problem to do emissions on the transitions -- I didn't realise
you wanted it that way. It does mean we will have to come up with some
funky ModelParameters adapter classes for certain situations, but we were
probably going to need those anyway.

> >   typedef Score::NatScoreVector  Metasequence;   // a log odds-ratio for each residue of a sequence (e.g. splice site prediction scores)
> >   typedef sequence<Metasequence> MetasequenceVector;   // multiple Metasequences for multiple prediction categories
> 
> One of the issues here is that it is very common to have a metasequence
> built which the actual state machine only uses a few metasequences (for
> example, only uses the 5' splice sites). I felt a drawback to my
> complexsequence is that, like above, the Metasequences were just an array
> and you had to know that you were looking at "2" in the Metasequence
> array.
> 
> I wonder is somehow associating a name with the array is not a bad thing.
> Hmmmmmmmmmmm.

I originally tried this but rejected it for simplicity.

No problem to add it back in... you just need to replace the
MetasequenceVector with some kind of MetasequenceMap i.e.
map<X,Metasequence> to use STL notation

first decide whether you want X to be a string or a wrapper class like
State below

then roll your own map<X,Metasequence> interface with get & set methods

finally replace IntVector with sequence<X>

> >   // state & transition interfaces
> > 
> >   typedef sequence<int> IntVector;
> > 
> >   interface State {    // concrete
> > 
> >     readonly attribute string    name;                  // non-unique state ID
> >     readonly attribute boolean   emitting;              // false if this is a null state
> >     readonly attribute IntVector metasequence_indices;  // list of the Metasequence scores incurred on entering this state
> > 
> >     boolean equals (State s);  // test for equivalence (this should NOT be based on the name, but probably some internal unique tag)
> >     State   clone();           // copy constructor
> 
> Why do we need clone()?

I guess we don't. I put it in before I fully understood about IDL
distinctions between "by value" (structs only) & "by reference"
(interfaces only).

> >   };
> >   
> >   struct Transition {
> >     State from;
> >     State to;
> >   };
> 
> I want emissions in here.

The "boolean emitting" can be moved here here from the State interface but
I think the actual emission probabilities belong in the ModelParameters
bit.

> 
> Also don't we need emission length or lookback length, or something?
> 
> What about emitting triplets?
> 

Cool! Care to write the IDL? ;-)

You might want to replace IntVector (the metasequence indices vector) with
sequence<IntVector> too, one for each lookback position.

> > 
> >   typedef sequence<State>      StateVector;
> >   typedef sequence<Transition> TransitionVector;
> > 
> >   // now the model interfaces
> > 
> >   interface Model {     // abstract
> > 
> >     string           name();
> >     StateVector      states();
> >     TransitionVector transitions();
> >     State            begin();           // every model is born with a begin & an end state
> >     State            end();
> >     int              metasequences();   // number of Metasequences that this model requires
> > 
> >     // perhaps we should add a non-virtual factory method for a vanilla WriteableModelParameters here
> >     // (although I hate that word "vanilla"... vanilla is my favourite flavour and now it's a synonym for "default")
> >   };
> > 
> >   interface WriteableModel : Model {    // concrete
> > 
> >     State new_state (in string name,
> > 		     in boolean emitting,
> > 		     in IntVector metasequence_indices);    // allocates a new State but doesn't add it to the model
> > 
> >     void  add_state (in State s);        // adds a State to the model
> >     void  remove_state (in State s);     // guess what this does
> > 
> >     // remove_state should raise an exception if user tries to remove the begin or end states
> >     
> >     // don't need a factory method for Transitions since they're just struct's. but we do need add & remove methods:
> > 
> >     void add_transition (in Transition t);
> >     void remove_transition (in Transition t);
> > 
> >     void clear();               // clears all states & transitions
> >     void clear_transitions();   // clears transitions but leaves states
> > 
> >     void set_metasequences (int n);   // set the number of metasequences
> > 
> >     // The set_metasequences method should raise an exception unless the model is empty (i.e. no states)
> >   };
> > 
> >   // parameter stuff
> > 
> >   interface ModelParameters {      // abstract
> > 
> >     readonly attribute Score::Scheme scoring_scheme;
> > 
> >     Score::NatScore       get_log_transition_probability (in Transition t);
> >     Score::NatScoreVector get_log_emission_probabilities (in State s);
> > 
> >     // ihh: I no longer think the parallel array stuff is a good idea for the top-level interface.
> >     // These get_log_.*_probability() methods are not something we want to be calling inside a DP routine,
> >     // but they're fine for the top level. We can use parallel arrays (or whatever) lower down, if needs be.
> >     //
> >     // Actually -- let's stick our necks out and boast that any call to get_log_.*_probability()
> >     // will only take O(log M) time, where M is the number of states in the model.
> >   };
> > 
> >   interface WriteableModelParameters : ModelParameters {    // concrete
> > 
> >     void set_log_transition_probability (in Transition t, in State::NatScore p);
> >     void set_log_emission_probabilities (in State s, in State::NatScoreVector p_vec);
> > 
> >     // I think it is useful to separate out the WriteableModelParameters interface
> >     // from the ModelParameters interface, because we might want to have a subclass of
> >     // ModelParameters that was a wrapper to a smaller parameter space (e.g. SmithWaterman) -ihh
> >   };
> > 
> >   // counts stuff (used for training)
> >   
> >   interface ModelCounts {   // concrete
> > 
> >     readonly attribute Score::Scheme scoring_scheme;
> > 
> >     Score::NatScore       get_log_transition_count (in Transition t);
> >     Score::NatScoreVector get_log_emission_counts (in State s);
> > 
> >     void zero_counts (in Model m);     // sets all log-counts to -infinity
> > 
> >     void increase_log_transition_count (in Transition t, in Score::NatScore log_increment);
> >     void increase_log_emission_counts (in State s, in Score::NatScoreVector log_increment_vec);
> > 
> >     // I don't think there's any advantage in having a separate WriteableModelCounts interface -ihh
> >   };
> > 
> >   // null model
> >   
> >   struct NullParameters {
> >     Score::NatScoreVector emit;     // vector of log-likelihoods of each residue according to null model
> >     Score::NatScore       extend;   // log-likelihood of single-residue sequence extension according to null model
> >     Score::NatScore       end;      // log-likelihood of sequence termination according to null model
> >   };
> > 
> >   // alignments
> >   
> >   struct Traceback {
> >     int             query_start;     // sequence start index for alignment
> >     StateVector     path;
> >     Score::NatScore log_likelihood_ratio;
> >   };
> > 
> 
> Do we want the traceback to have scores per state jump?

I think it'd be better if there were a way to ask the Model this.

> 
> We are implicitly banning two different transitions of the same offset
> between the same states. This is what I do anyway, I just want to point it
> out ;)
> 

:) cool. (We could remove this limitation if we made Transition an
interface like State, and gave it an equals() method and a factory method
in Model (like State), and replaced "StateVector path" with
"TransitionVector path"... but I am not that bothered -- Occam's razor
applied to design here, I think)

> 
> >   // alignment tether (specifies global, local, semi-local etc)
> > 
> >   struct AlignmentTether {
> > 
> >     boolean left_tethered;      // if these are both TRUE then it's a global alignment
> >     boolean right_tethered;     // if they're both FALSE then it's local
> > 
> >     // Local alignment implies that the log-likelihood will have an infinite component, because if the cost
> >     // of extending the "flanking" region (i.e. the unaligned part) is zero then the penalty for leaving
> >     // it has to be infinite for the corresponding global model to stay normalised.
> >     // Since computer languages with infinite quantities as basic types are lamentably rare,
> >     // we cancel this by ensuring that the null model has exactly as many infinitely-penalised
> >     // transitions as the test model.
> >     // See http://www.sanger.ac.uk/Users/ihh/thesis.ps.gz pp42-44 for a worked example.
> >   };
> > 
> >   // now the algorithm interfaces
> >   
> >   interface ViterbiAlgorithm {        // abstract
> > 
> >     // the idea is that concrete implementations (vanilla, linear-space etc) will inherit from this class
> > 
> >     Traceback do_Viterbi (in Model model,
> > 			  in ModelParameters parameters,
> > 			  in NullParameters null,
> > 			  in Sequence::LightSeq query,
> > 			  in MetasequenceVector metaquery,
> > 			  in AlignmentTether tether);
> >     
> >     // ihh: I'm starting to understand more & more of the rationale behind IDL & CORBA --
> >     // it makes sense to use a LightSeq interface rather than a LightSeqMomento struct here,
> >     // because interfaces are passed by reference and so are "lighter" (i.e. lower-bandwidth)
> >     // in a network context. c00l!
> 
> Less of the phREAker000 stuff...
> 

1t'5 1n my  |3 |_ 0 0 |>,  |> () () |>


> >   };
> > 
> >   interface ForwardBackwardAlgorithm {      // abstract
> >     
> >     ModelCounts do_ForwardBackward (in Model model,
> > 				    in ModelParameters parameters,
> > 				    in NullParameters null,
> > 				    in Sequence::LightSeq query,
> > 				    in MetasequenceVector metaquery,
> > 				    in AlignmentTether tether);
> >   };
> > 
> >   // maybe we want a ForwardAlgorithm interface too -- this should just return a forward score (no traceback)
> >   // There should also be a variant of ForwardAlgorithm that returns a ForwardMatrix object
> >   // -- which the user can then obtain sample tracebacks from. Perhaps Viterbi should do this too.
> > 
> >   // basic training interface
> >   
> >   interface ParameterUpdateAlgorithm {    // abstract
> >     void update_parameters (in ModelCounts counts, out WriteableModelParameters parameters);
> >   };
> > 
> >  }
> > 
> > 
> > _______________________________________________
> > Dynamite mailing list  -  Dynamite@bioperl.org
> > http://www.bioperl.org/mailman/listinfo/dynamite
> > 
> 
>