[Dynamite] SingleModel with metasequences
Ewan Birney
birney@ebi.ac.uk
Wed, 8 Mar 2000 16:55:45 +0000 (GMT)
On Tue, 7 Mar 2000, Ian Holmes wrote:
> Here's my latest SingleModel. I'm quite pleased with it. It replaces
> Ewan's ComplexSequence with the equally badly-named Metasequence (better
> name would be welcomed!).
<grin>. Metasequence isn't so bad. But neither is complexsequence.
>
> A metasequence is an array of scores (e.g. splice site predictions)
> associated with each residue of a sequence.. assumed to be independent of
> the sequence itself (altho it's usually not).
>
> You can have more than one metasequence for a model (e.g. splice site
> predictions + hexamer coding potentials). Any state can use scores from
> any number of metasequences.
>
> Also, I moved the emission scores so they're now associated with states,
> not transitions. I think this is what we're all used to...
Not me! I *always* think of emissions on the transitions. Please - emit
on transitions. Otherwise we get burnt in
- genewise (transitions into MATCH have different emissions)
- HMMER (J state)
- everything else I do ;)
>
> Local/semilocal/global alignment is specified by means of
> AlignmentTether structures.
>
> I have labelled each interface "abstract" (meaning it doesn't need a
> factory method) or "concrete" (meaning it does).
>
> here goes (it's also on the wiki) -- please rip to shreds & tell me what
> you think:
>
> module SingleModel {
>
> // metasequences
>
> typedef Score::NatScoreVector Metasequence; // a log odds-ratio for each residue of a sequence (e.g. splice site prediction scores)
> typedef sequence<Metasequence> MetasequenceVector; // multiple Metasequences for multiple prediction categories
One of the issues here is that it is very common to have a metasequence
built which the actual state machine only uses a few metasequences (for
example, only uses the 5' splice sites). I felt a drawback to my
complexsequence is that, like above, the Metasequences were just an array
and you had to know that you were looking at "2" in the Metasequence
array.
I wonder is somehow associating a name with the array is not a bad thing.
Hmmmmmmmmmmm.
>
> // state & transition interfaces
>
> typedef sequence<int> IntVector;
>
> interface State { // concrete
>
> readonly attribute string name; // non-unique state ID
> readonly attribute boolean emitting; // false if this is a null state
> readonly attribute IntVector metasequence_indices; // list of the Metasequence scores incurred on entering this state
>
> boolean equals (State s); // test for equivalence (this should NOT be based on the name, but probably some internal unique tag)
> State clone(); // copy constructor
Why do we need clone()?
> };
>
> struct Transition {
> State from;
> State to;
> };
I want emissions in here.
Also don't we need emission length or lookback length, or something?
What about emitting triplets?
>
> typedef sequence<State> StateVector;
> typedef sequence<Transition> TransitionVector;
>
> // now the model interfaces
>
> interface Model { // abstract
>
> string name();
> StateVector states();
> TransitionVector transitions();
> State begin(); // every model is born with a begin & an end state
> State end();
> int metasequences(); // number of Metasequences that this model requires
>
> // perhaps we should add a non-virtual factory method for a vanilla WriteableModelParameters here
> // (although I hate that word "vanilla"... vanilla is my favourite flavour and now it's a synonym for "default")
> };
>
> interface WriteableModel : Model { // concrete
>
> State new_state (in string name,
> in boolean emitting,
> in IntVector metasequence_indices); // allocates a new State but doesn't add it to the model
>
> void add_state (in State s); // adds a State to the model
> void remove_state (in State s); // guess what this does
>
> // remove_state should raise an exception if user tries to remove the begin or end states
>
> // don't need a factory method for Transitions since they're just struct's. but we do need add & remove methods:
>
> void add_transition (in Transition t);
> void remove_transition (in Transition t);
>
> void clear(); // clears all states & transitions
> void clear_transitions(); // clears transitions but leaves states
>
> void set_metasequences (int n); // set the number of metasequences
>
> // The set_metasequences method should raise an exception unless the model is empty (i.e. no states)
> };
>
> // parameter stuff
>
> interface ModelParameters { // abstract
>
> readonly attribute Score::Scheme scoring_scheme;
>
> Score::NatScore get_log_transition_probability (in Transition t);
> Score::NatScoreVector get_log_emission_probabilities (in State s);
>
> // ihh: I no longer think the parallel array stuff is a good idea for the top-level interface.
> // These get_log_.*_probability() methods are not something we want to be calling inside a DP routine,
> // but they're fine for the top level. We can use parallel arrays (or whatever) lower down, if needs be.
> //
> // Actually -- let's stick our necks out and boast that any call to get_log_.*_probability()
> // will only take O(log M) time, where M is the number of states in the model.
> };
>
> interface WriteableModelParameters : ModelParameters { // concrete
>
> void set_log_transition_probability (in Transition t, in State::NatScore p);
> void set_log_emission_probabilities (in State s, in State::NatScoreVector p_vec);
>
> // I think it is useful to separate out the WriteableModelParameters interface
> // from the ModelParameters interface, because we might want to have a subclass of
> // ModelParameters that was a wrapper to a smaller parameter space (e.g. SmithWaterman) -ihh
> };
>
> // counts stuff (used for training)
>
> interface ModelCounts { // concrete
>
> readonly attribute Score::Scheme scoring_scheme;
>
> Score::NatScore get_log_transition_count (in Transition t);
> Score::NatScoreVector get_log_emission_counts (in State s);
>
> void zero_counts (in Model m); // sets all log-counts to -infinity
>
> void increase_log_transition_count (in Transition t, in Score::NatScore log_increment);
> void increase_log_emission_counts (in State s, in Score::NatScoreVector log_increment_vec);
>
> // I don't think there's any advantage in having a separate WriteableModelCounts interface -ihh
> };
>
> // null model
>
> struct NullParameters {
> Score::NatScoreVector emit; // vector of log-likelihoods of each residue according to null model
> Score::NatScore extend; // log-likelihood of single-residue sequence extension according to null model
> Score::NatScore end; // log-likelihood of sequence termination according to null model
> };
>
> // alignments
>
> struct Traceback {
> int query_start; // sequence start index for alignment
> StateVector path;
> Score::NatScore log_likelihood_ratio;
> };
>
Do we want the traceback to have scores per state jump?
We are implicitly banning two different transitions of the same offset
between the same states. This is what I do anyway, I just want to point it
out ;)
> // alignment tether (specifies global, local, semi-local etc)
>
> struct AlignmentTether {
>
> boolean left_tethered; // if these are both TRUE then it's a global alignment
> boolean right_tethered; // if they're both FALSE then it's local
>
> // Local alignment implies that the log-likelihood will have an infinite component, because if the cost
> // of extending the "flanking" region (i.e. the unaligned part) is zero then the penalty for leaving
> // it has to be infinite for the corresponding global model to stay normalised.
> // Since computer languages with infinite quantities as basic types are lamentably rare,
> // we cancel this by ensuring that the null model has exactly as many infinitely-penalised
> // transitions as the test model.
> // See http://www.sanger.ac.uk/Users/ihh/thesis.ps.gz pp42-44 for a worked example.
> };
>
> // now the algorithm interfaces
>
> interface ViterbiAlgorithm { // abstract
>
> // the idea is that concrete implementations (vanilla, linear-space etc) will inherit from this class
>
> Traceback do_Viterbi (in Model model,
> in ModelParameters parameters,
> in NullParameters null,
> in Sequence::LightSeq query,
> in MetasequenceVector metaquery,
> in AlignmentTether tether);
>
> // ihh: I'm starting to understand more & more of the rationale behind IDL & CORBA --
> // it makes sense to use a LightSeq interface rather than a LightSeqMomento struct here,
> // because interfaces are passed by reference and so are "lighter" (i.e. lower-bandwidth)
> // in a network context. c00l!
Less of the phREAker000 stuff...
> };
>
> interface ForwardBackwardAlgorithm { // abstract
>
> ModelCounts do_ForwardBackward (in Model model,
> in ModelParameters parameters,
> in NullParameters null,
> in Sequence::LightSeq query,
> in MetasequenceVector metaquery,
> in AlignmentTether tether);
> };
>
> // maybe we want a ForwardAlgorithm interface too -- this should just return a forward score (no traceback)
> // There should also be a variant of ForwardAlgorithm that returns a ForwardMatrix object
> // -- which the user can then obtain sample tracebacks from. Perhaps Viterbi should do this too.
>
> // basic training interface
>
> interface ParameterUpdateAlgorithm { // abstract
> void update_parameters (in ModelCounts counts, out WriteableModelParameters parameters);
> };
>
> }
>
>
> _______________________________________________
> Dynamite mailing list - Dynamite@bioperl.org
> http://www.bioperl.org/mailman/listinfo/dynamite
>