[Dynamite] SingleModel with metasequences
Ian Holmes
ihh@fruitfly.org
Tue, 7 Mar 2000 21:49:19 -0800 (PST)
Here's my latest SingleModel. I'm quite pleased with it. It replaces
Ewan's ComplexSequence with the equally badly-named Metasequence (better
name would be welcomed!).
A metasequence is an array of scores (e.g. splice site predictions)
associated with each residue of a sequence.. assumed to be independent of
the sequence itself (altho it's usually not).
You can have more than one metasequence for a model (e.g. splice site
predictions + hexamer coding potentials). Any state can use scores from
any number of metasequences.
Also, I moved the emission scores so they're now associated with states,
not transitions. I think this is what we're all used to...
Local/semilocal/global alignment is specified by means of
AlignmentTether structures.
I have labelled each interface "abstract" (meaning it doesn't need a
factory method) or "concrete" (meaning it does).
here goes (it's also on the wiki) -- please rip to shreds & tell me what
you think:
module SingleModel {
// metasequences
typedef Score::NatScoreVector Metasequence; // a log odds-ratio for each residue of a sequence (e.g. splice site prediction scores)
typedef sequence<Metasequence> MetasequenceVector; // multiple Metasequences for multiple prediction categories
// state & transition interfaces
typedef sequence<int> IntVector;
interface State { // concrete
readonly attribute string name; // non-unique state ID
readonly attribute boolean emitting; // false if this is a null state
readonly attribute IntVector metasequence_indices; // list of the Metasequence scores incurred on entering this state
boolean equals (State s); // test for equivalence (this should NOT be based on the name, but probably some internal unique tag)
State clone(); // copy constructor
};
struct Transition {
State from;
State to;
};
typedef sequence<State> StateVector;
typedef sequence<Transition> TransitionVector;
// now the model interfaces
interface Model { // abstract
string name();
StateVector states();
TransitionVector transitions();
State begin(); // every model is born with a begin & an end state
State end();
int metasequences(); // number of Metasequences that this model requires
// perhaps we should add a non-virtual factory method for a vanilla WriteableModelParameters here
// (although I hate that word "vanilla"... vanilla is my favourite flavour and now it's a synonym for "default")
};
interface WriteableModel : Model { // concrete
State new_state (in string name,
in boolean emitting,
in IntVector metasequence_indices); // allocates a new State but doesn't add it to the model
void add_state (in State s); // adds a State to the model
void remove_state (in State s); // guess what this does
// remove_state should raise an exception if user tries to remove the begin or end states
// don't need a factory method for Transitions since they're just struct's. but we do need add & remove methods:
void add_transition (in Transition t);
void remove_transition (in Transition t);
void clear(); // clears all states & transitions
void clear_transitions(); // clears transitions but leaves states
void set_metasequences (int n); // set the number of metasequences
// The set_metasequences method should raise an exception unless the model is empty (i.e. no states)
};
// parameter stuff
interface ModelParameters { // abstract
readonly attribute Score::Scheme scoring_scheme;
Score::NatScore get_log_transition_probability (in Transition t);
Score::NatScoreVector get_log_emission_probabilities (in State s);
// ihh: I no longer think the parallel array stuff is a good idea for the top-level interface.
// These get_log_.*_probability() methods are not something we want to be calling inside a DP routine,
// but they're fine for the top level. We can use parallel arrays (or whatever) lower down, if needs be.
//
// Actually -- let's stick our necks out and boast that any call to get_log_.*_probability()
// will only take O(log M) time, where M is the number of states in the model.
};
interface WriteableModelParameters : ModelParameters { // concrete
void set_log_transition_probability (in Transition t, in State::NatScore p);
void set_log_emission_probabilities (in State s, in State::NatScoreVector p_vec);
// I think it is useful to separate out the WriteableModelParameters interface
// from the ModelParameters interface, because we might want to have a subclass of
// ModelParameters that was a wrapper to a smaller parameter space (e.g. SmithWaterman) -ihh
};
// counts stuff (used for training)
interface ModelCounts { // concrete
readonly attribute Score::Scheme scoring_scheme;
Score::NatScore get_log_transition_count (in Transition t);
Score::NatScoreVector get_log_emission_counts (in State s);
void zero_counts (in Model m); // sets all log-counts to -infinity
void increase_log_transition_count (in Transition t, in Score::NatScore log_increment);
void increase_log_emission_counts (in State s, in Score::NatScoreVector log_increment_vec);
// I don't think there's any advantage in having a separate WriteableModelCounts interface -ihh
};
// null model
struct NullParameters {
Score::NatScoreVector emit; // vector of log-likelihoods of each residue according to null model
Score::NatScore extend; // log-likelihood of single-residue sequence extension according to null model
Score::NatScore end; // log-likelihood of sequence termination according to null model
};
// alignments
struct Traceback {
int query_start; // sequence start index for alignment
StateVector path;
Score::NatScore log_likelihood_ratio;
};
// alignment tether (specifies global, local, semi-local etc)
struct AlignmentTether {
boolean left_tethered; // if these are both TRUE then it's a global alignment
boolean right_tethered; // if they're both FALSE then it's local
// Local alignment implies that the log-likelihood will have an infinite component, because if the cost
// of extending the "flanking" region (i.e. the unaligned part) is zero then the penalty for leaving
// it has to be infinite for the corresponding global model to stay normalised.
// Since computer languages with infinite quantities as basic types are lamentably rare,
// we cancel this by ensuring that the null model has exactly as many infinitely-penalised
// transitions as the test model.
// See http://www.sanger.ac.uk/Users/ihh/thesis.ps.gz pp42-44 for a worked example.
};
// now the algorithm interfaces
interface ViterbiAlgorithm { // abstract
// the idea is that concrete implementations (vanilla, linear-space etc) will inherit from this class
Traceback do_Viterbi (in Model model,
in ModelParameters parameters,
in NullParameters null,
in Sequence::LightSeq query,
in MetasequenceVector metaquery,
in AlignmentTether tether);
// ihh: I'm starting to understand more & more of the rationale behind IDL & CORBA --
// it makes sense to use a LightSeq interface rather than a LightSeqMomento struct here,
// because interfaces are passed by reference and so are "lighter" (i.e. lower-bandwidth)
// in a network context. c00l!
};
interface ForwardBackwardAlgorithm { // abstract
ModelCounts do_ForwardBackward (in Model model,
in ModelParameters parameters,
in NullParameters null,
in Sequence::LightSeq query,
in MetasequenceVector metaquery,
in AlignmentTether tether);
};
// maybe we want a ForwardAlgorithm interface too -- this should just return a forward score (no traceback)
// There should also be a variant of ForwardAlgorithm that returns a ForwardMatrix object
// -- which the user can then obtain sample tracebacks from. Perhaps Viterbi should do this too.
// basic training interface
interface ParameterUpdateAlgorithm { // abstract
void update_parameters (in ModelCounts counts, out WriteableModelParameters parameters);
};
}