[Bioperl-l] Bio::Ontology
Chris Mungall
cjm@fruitfly.org
Thu, 19 Sep 2002 08:11:07 -0700 (PDT)
Ok, I have some controlled vocabulary and graph code ready to check in
I was going to check it in initially as another branch (this is mainly
because I don't want to deal with namespace changes and cvs, could get
ugly) - there seems to be a lock preventing me:
cvs server: failed to obtain dir lock in repository
`/home/repository/bioperl/bioperl-live/Bio/Tools/Run/Phylo'
anywhere, for now i have stuck the code in
fruitfly.org/~cjm/bioperl-live.tar.gz
if everyone is happy with the namespaces i'll go ahead and commit it onto
the main branch
I'm including a pseudo interface spec below
So far there are implementations for the basic graph vocab stuff, but not
for associations between entities (eg gene products, markers) and CV
terms. i'd like to generate a bit of discussion on these first. Ewan,
Jason and I discussed this briefly - what should be the root interface for
the associated entities be? Jason and Ewan had some reservations about the
interface (AnnotatableI) below. It does have the advantage that tools like
AmiGO could be neutral wrt whether they were associating genes to GO,
markers to a phenotypic/trait ontology, images/experiments to an anatomy
ontology.
There are also parsers; these use an event based model, but have
Bio::OntologyIO wrappers. I'd say it's ready to use for the purpose of
feature types and a sequence ontology.
! OK, here's the proposed spec for the bioperl ontologies component
! (maybe this could be the interface definitions for the other bio*
projects
! too?)
! it's kind of interface heavy, but I think this is necessary if we want
to
! keep this generic, flexible, etc
!
! these are the use cases assumed:
!
! flat lists of controlled biological terms, where each biological term
! can have various tracking info added (synonyms, dbxrefs, etc)
!
! structured controlled vocabularies (aka loose semantic networks),
! including the following:
!
! trees of bioterms, which represent subtype/supertype relationships
!
! DAGs of bioterms, where all arcs in the graph represent
subtype/supertype
! relationships
!
! DAGs of bioterms, where the arcs can represent different relationship
! types, a la Gene Ontology - typical relationship types would be
! ISA, PARTOF. However, relationship types are restricted to those for
! which the true-path rule holds true (see
http://www.geneontology.org/...).
! currently this is just ISA, and PARTOF when PARTOF is used in the sense
of
! "necessarily part of". for instance, a 'door' is part of a 'car', but it
! isn't necessarily part of a car - it could be part of a house. we could
! introduce PARTOF(car_door, car) which is always true.
! the true path rule is useful for ontology consistency, and for answering
! recursive queries; for example, to answer 'find all genes that are
! transmembrane receptors' we would find genes associated with TM
receptors
! AND all children of the TM receptor node.
!
! Graphs of bioterms, in which arcs in the graph or not necessarily
! transitive, and which cycles may be allowed. the true path rule may not
! necessarily hold. examples include vocabs that include temporal arcs
! (which may cause cycles, eg birth, cell cycles); the true path rule does
! not hold with temporal arcs (eg if a gene is expressed at term:G_phase
it is
! not necessarily expressed at term:M_phase; if a gene has a phenotype of
the
! right_leg, it IS correct to say that it is a limb phenotype, but NOT
correct
! to say that it is an embryonic phenotype, even though right_leg is a
recursive
! child of embryo via DEVELOPS-FROM arcs/relationships)
!
! In addition to the different kinds of vocabs/ontologies above, there is
! support for making associations between entities and vocab terms; these
! entities could be gene products (ie proteins or RNA products), genes,
! markers/alleles (eg in phenotypic/trait ontologies), sequences;
! these could be represented by Bio::SeqI, Bio::SeqFeatureI or
Bio::Map::MarkerI
! objects.
!
! associations are entities in their own right, as they are assertions
made
! at some particular time by a person or a computational analysis, and
should
! be tracked with references. see http://... for a list of GO evidence
criteria
! (of course this object model is not restricted to GO criteria)
!
! NOTE ON SEQUENCE FEATURE ONTOLOGIES
! for associations between a SeqFeatureI entity with a feature type (eg in
SO)
! (currently done with the $sf->primary_tag() method) there will probably
! be no association/evidence; we should probably add $sf->feature_type
! which returns a VocabTermI object, and make $sf->primary_tag delegate to
! $sf->feature_type()->label()
!
! there is no *explicit* support for frames-style/description logic
ontologies
! (you really want to be using more specialised tools and not bioperl for
this
! anyway); however, provision has been made for layering these on in a
! compatible way. at this time, the most prevalent ontologies within
biology
! (or at least the most prevalent use cases within bioperl) are structured
! controlled vocabularies, GO style (although these can easily be
represented
! as a full frame style or DL ontology, provided a few constraints are
followed,
! this ramping up of expressive power would force unnecessary complexity
in
! this object model).
!
! one can imagine different implementations of these interfaces;
! eg memory based vs secondary storage based (while most vocabs fit
! into memory, vocabs PLUS the entities associated with the terms
generally
! do not).
! different implementations may provide different semantics of some of the
! operations below. for instance, a simple graph implementation would
traverse
! down the graph to implement get_all_children(); a graph/vocab with added
! semantics may choose to only traverse recursive relationships.
!
! TODO: further split this into components
! (triples, graphs, vocabs, associations/annotations)
namespace Bio::Graph
enum TraversalMethod { BREADTH_FIRST, DEPTH_FIRST };
enum TraversalDirection { DOWN, UP };
typedef string TripleElement
typedef string Identifier
typedef string TimeStamp
typedef Bio::Annotation::DBLink DBLink
interface TripleI extends Bio::Root::RootI
attribute TripleElement subject
attribute TripleElement predicate
attribute TripleElement object
interface TripleStoreI extends Bio::Root::RootI
add(TripleI triple): # adds new triple to store
get(TripleI triple): TripleI[] # fetches matching triples
interface NodeI extends Bio::Root::RootI
attribute string identifier
attribute ANY node_data
interface ArcI extends Bio::Root::RootI
attribute NodeI parent_node
attribute NodeI child_node
attribute NodeI arctype_node
arc_label(): string # description of relationship type
attribute ANY arc_data
interface PathI extends Bio::Root::RootI
attribute ArcI[] arcs # attribute accessor
reverse():
interface GraphIteratorI extends Bio::Root::RootI
reset_cursor():
attribute TraversalMethod traversal_method
attribute TraversalDirection traversal_direction
this_node(): NodeI
next():
next_node(): NodeI
path(): PathI # path to get here from initial node
depth(): int
# graphs are implemented on top of triple stores;
# this allows the API user to access the underlying binary
# predicates directly if desired.
# different implementations of GraphI may choose to implement the
# semantics of the methods below differently; it may delegate
# directly to the underlying triple store, or it may apply some
# semantics (for instance, it may be desirable to only treat
# transitive predicates as parents/children, and effectively
# hide non-transitive predicates from at the graph interface level
interface GraphI extends TripleStoreI
add_arc(ArcI arc):
add_node(NodeI node):
get_node(Identifier identifier): NodeI
get_all_nodes(): NodeI[]
get_all_arcs(): ArcI[]
get_all_arctypes(): NodeI[]
get_child_nodes(NodeI node): NodeI[]
get_all_child_nodes(NodeI node): NodeI[]
get_parent_nodes(NodeI node): NodeI[]
get_all_parent_nodes(NodeI node): NodeI[]
get_graph_iterator(NodeI node, TraversalMethod traversal_method):
GraphIteratorI
get_root_nodes(): NodeI[]
paths_to_root(): PathI[]
get_leaf_nodes(): NodeI[]
namespace Bio::Ontology
typedef string Identifier
interface VocabTermI extends Bio::Graph::NodeI
attribute Identifier identifier
attribute string label
attribute VocabDefinition definition
attribute string[] synonyms
add_synonym(string synonym):
attribute DBLink[] dblinks
add_dblink(DBLink dblink):
timestamp(): timestamp
category(): VocabTerm
is_obsolete(): boolean
interface RelationshipI extends Bio::Graph::ArcI
attribute Identifier identifier
attribute TermI parent_term
attribute TermI child_term
attribute TermI relationship_type
interface VocabDefinitionI extends Bio::Root::RootI
attribute string definition
attribute DBLink reference
timestamp(): timestamp
interface VocabI extends Bio::Root::RootI
get_term(Identifier identifier): VocabTermI
get_terms_by_label(): VocabTermI[] # note: name/desc not unique
get_all_terms(): VocabTermI[]
get_all_relationships(): RelationshipI[]
get_all_relationship_types(): VocabTermI[]
add_term(VocabTermI term): VocabTermI
add_relationship(RelationshipI relationship): RelationshipI
create_term(Identifier id, string label, string[] synonyms, DBLink[]
dblinks): VocabTermI
create_relationship(Identifier id, VocabTermI parent, VocabTermI child,
VocabTermI relationship_type): RelationshipI
interface StructuredVocabI extends VocabI
! ------------------------------------------------------------------------
# A Graph Vocabulary is a Structured controlled vocabulary with vocabulary
# terms arranged in a graph (or semantic network) structure. parent/child
# relationships in the graph often represent subsumption relationships
# (ie where a more general term subsumes a more specific one) but it is
# not always safe to assume so. The relationships are often transitive,
# but this is not always the case. The graph may be acyclic or may contain
# cycles (in which case recursive traversals must be checked for cycles,
and
# there will be no roots or leaves)
#
# The GraphVocabI interface provides different methods depending on
# what semantics are required; some programs may not care about the
meaning
# of arcs in the graph (eg graph visualisation tools). Other programs may
# only be interested in subclass/superclass hierarchies, or subsumption
# hierarchies
#
# depending on the implementation, 'covered' and 'covered_by' may mean
# exactly the same as 'child' and 'parent' respectively; if the
implementation
# provides some kind of semantics, the meaning may be more restricted.
# for instance, temporal relationship types may not be included in the
# covered/covered_by list, as they do not follow the true path rule.
# relationships that cover/subsume may be : ISA, PARTOF
#
# subclass/superclass relationships are strict inheritance hierarchies
# eg ISA
#
# each implementation of this interface should clearly specify the
# semantics of the different graph traversal calls
interface GraphVocabI extends Bio::Graph::GraphI, StructuredVocabI
get_child_terms(VocabTermI term): VocabTermI[]
get_all_child_terms(VocabTermI term): VocabTermI[]
get_parent_terms(VocabTermI term): VocabTermI[]
get_all_parent_terms(VocabTermI term): VocabTermI[]
get_covered_terms(VocabTermI term): VocabTermI[] # terms subsumed
get_all_covered_terms(VocabTermI term): VocabTermI[] # terms subsumed,
recursive
get_covered_by_terms(VocabTermI term): VocabTermI[] # subsuming terms
get_all_covered_by_terms(VocabTermI term): VocabTermI[] # subsuming
terms, recursive
get_subclass_terms(VocabTermI term): VocabTermI[] # inheriting
terms
get_all_subclass_terms(VocabTermI term): VocabTermI[] # inheriting
terms, recursive
get_superclass_terms(VocabTermI term): VocabTermI[] # inherited terms
get_all_superclass_terms(VocabTermI term): VocabTermI[] # inherited
terms, recursive
is_acyclic(bool is_acyclic): boolean
is_rooted(bool is_acyclic): boolean
is_relationship_type_acyclic(VocabTermI relationship_type, boolean is):
boolean
is_relationship_type_rooted(VocabTermI relationship_type, boolean is):
boolean
is_relationship_type_covering(VocabTermI relationship_type, boolean is):
boolean
is_relationship_type_subclass(VocabTermI relationship_type, boolean is):
boolean
is_relationship_type_transitive(VocabTermI relationship_type, boolean
is): boolean
# entities associated could be:
# gene products, sequences, seqfeatures, markers (eg phenotypic/trait
ontologies)
# an association is often to a single term, but sometimes we may want to
# make associations to multiple terms from orthogonal ontologies; e.g.
# geneX is involved in aorta + time_stage25 + growth
interface AssociationI
attribute VocabTermI[] vocab_terms
attribute AnnotatableI[] associated_entities
attribute EvidenceI[] evidence
timestamp(): Timestamp
# question: should we have a common interface for Marker and GeneProduct
# that implements some standard methods/attributes;
# eg identifier,species, name, label, dblinks,...
# this way, a generic vocab+association tool such as AmiGO could
# be made to work with GO+genes, OR with PO+markers/alleles...
# Refactor Bio::AnnotatableI to be used by Bio::SeqI, and AnnotableI
# has annotation_collection and convience methods
interface AnnotatableI
attribute Identifier identifier
attribute string label # e.g. gene symbol
attribute string full_name # e.g. full gene name
attribute string description # e.g. text desc of gene
attribute DBLink[] dblinks
attribute Bio::Map::MarkerI[] markers # any associated
markers
attribute SeqI[] seqs
attribute SeqFeatureI[] seq_features
attribute Bio::Species species
attribute string source
interface EvidenceI
attribute VocabTermI[] evidence_types
attribute DBLink[] references # e.g. medline entries
attribute DBLink[] evidence_dblinks # e.g. swissprot
accessions
# filter for fetching terms/associations; simple querying system
# examples could be species, source, evidence
interface FilterI
attribute Bio::Species[] species
attribute string[] sources
attribute string[] evidences
interface AssociationStoreI
get_all_annotatables(): AnnotatableI[]
get_annotatables_by_terms(VocabTermI[] vocab_terms): AnnotatableI[]
get_terms_by_annotatables(AnnotatableI[] annotatables): VocabTermI[]
set_filter(FilterI filter):
interface CombinedVocabI extends GraphVocabI, AssociationStoreI
interface FactoryI extends Bio::Root::RootI
create_Graph(): GraphI
create_GraphVocab(): GraphVocabI
create_VocabTerm(): VocabTermI
!class Factory implements FactoryI extends Bio::Root::Root
!class Bio::Ontology::Triple::Triple implements TripleI extends
Bio::Root::Root
!class Bio::Ontology::Triple::TripleStore implements TripleStoreI extends
Bio::Root::Root
!class Bio::Ontology::Graph::Graph implements GraphI extends
Bio::Ontology::Triple::TripleStore
!class Bio::Ontology::Graph::Arc implements ArcI extends
Bio::Ontology::Triple::Triple
!class Bio::Ontology::Graph::Node implements NodeI extends Bio::Root::Root
!class Bio::Ontology::Graph::Path implements PathI extends Bio::Root::Root
!class Bio::Ontology::Graph::GraphIterator implements GraphIteratorI
extends Bio::Root::Root
!class Bio::Ontology::Vocab::VocabTerm implements VocabTermI extends
Bio::Ontology::Graph::Node
!class Bio::Ontology::Vocab::Relationship implements RelationshipI extends
Bio::Ontology::Graph::Arc
!class Bio::Ontology::Vocab::GraphVocab implements GraphVocabI extends
Bio::Ontology::Graph::Graph
!class Bio::Ontology::Association::Association implements AssociationI
extends Bio::Root::Root
!class Bio::Ontology::Association::Evidence implements EvidenceI extends
Bio::Root::Root
!class Bio::Ontology::Association::Annotatable implements AnnotatableI
extends Bio::Root::Root
namespace Bio::Ontology::KB
! this is meant to illustrate how the above system of interfaces could
! be extended into a full frame-style knoweldge base / ontology; don't
! expect any implementations of these interfaces for a while...
! should we aim for OKBC compliance here? (see http://?)
! should this be namespaced within bioperl?
! these interfaces could be implemented in different ways; one way
! would be to layer it directly on top of the existing graph layer,
! and implement the DAML+OIL axioms (hard to have full DAML+OIL compliance
! implemented in an imperative language) - or it could just form a bridge
! with an existing ontology tool/KB
! really this is just a demonstration of how this *might* be done - this
! is getting into a gnarly area... eg do we take a DL approach or a more
! expressive approach blah blah....
interface ClassI extends VocabTermI # concept aka Class aka Frame
all_classes(): ClassI[]
sub_classes(): ClassI[]
all_sub_classes(): ClassI[]
super_classes(): ClassI[]
all_super_classes(): ClassI[]
slots(SlotI []): SlotI
interface SlotI extends VocabTermI