[MOBY-l] MOBY at NCGR/CSHL- intro to ISYS and its conceptual relationship to MOBY

Thu Oct 3 23:12:17 UTC 2002

Hi Mark-

> I picked the one thing that I could respond to quickly - I'll have to
> chew on the rest of it for a while...

Wise choice, I think. I've been hoping to carry on the conversation in
more manageable pieces... Even this "simple" point will probably end
up generating a lot of discussion!

>
> >I have deliberately singled out the
> >"id" here, since you've put it at the root of the whole moby_class hierarchy.
> >While there may be other reasons for insisting on an id for all objects
> >in the system, in terms of its use as a "lookup", it really seems only relevant
> >for services that want to do a retrieval of information using that key.
> >
> >
> >
> I have to disagree, but only because we are putting different emphases
> on the idea of an "id".  The root of the class hierarchy is actually the
> Triple <Instance/Namespace/ID>.  In services which require only
> sequences (e.g. a service which calculates GC content or something like
> that) you are right in saying that the service doesn't need an ID.
>  However, we *do* need an ID if we are submitting a large number of
> sequences to that service (in a single transation) and receiving a large
> number of results in response.  The ID is the way to match the input
> with its output... especially since we do not guarantee that the service
> will provide a result for every input that it is given.
>
> Namespaces and ID's can be entirely arbitrary, and in these cases can be
> used by the Client for local "bookkeeping" , or they can be meaningful,
> such as when you have a retrieve service which requires a genbankGI
> number as input.  This is why the namespace parameter is optional during
> service registration - if the service doesn't care, then it just leaves
> that parameter out and is thus "discovered"  by anyone who has the
> correct object *type* in their hands, regardless of that object's namespace.

The main point I was trying to make in my earlier message
was not really about whether the concept of identification should be
made central to the data representation. It was really about how
one should be careful to identify "orthogonal concepts" in your
data modelling, and to avoid tangling them up together in your
ontology. I should probably have steered clear of the triple, since I
figured something like what you are describing was at the root of
it. So, just to make sure the main point isn't missed (before I
turn my attention back to the triple!), let me give a different
example of what I mean. I'm going to take an imperfect example from the
moby_classes file, but I may well be misinterpreting the intention
of the representation there, but don't focus too much on whether I'm
mischaracterizing the intention there, let's just see whether we agree
on the basic principle.

Polymorphism    [type {snp, indel, deletion, insertion, reversion} start, end, DNASequence]

GeneModel [DNASequence, ProteinSequence, orientation, Chromosome, start, end, SequenceFeatureAnnotation+]

Annotation [evidence]
    SequenceFeatureAnnotation [start, end, SequenceFeatureOntologyId]

First of all, the "start/end" concept appears at multiple points in
this representation. I don't think it really means anything different
in these contexts from the point of view of a consumer like a map/sequence
viewer, it's just a representation of position (of something) in a coordinate
system (though maybe the identity of the coordinate system is being handled
differently in the different places?, e.g. reference to DNASequence for
Polymorphism, reference to Chromosome for GeneModel?). So, if we "factor
out" this "data unit", we have the notion of something located, independent
of whether it has a "type" or "evidence" or a "sequence", etc.
So, if I am a new provider of data coming into MOBY, I may have positional
information of some sort, and that's good enough to make a meaningful
statement about my data, without me needing to worry about whether it
maps better onto the concept of a GeneModel or a SequenceFeatureAnnotation.

Now, let's look at the "evidence" property of Annotation and the
"inheritance" relationship between Annotation and SequenceFeatureAnnotation.
Is it the case that every Annotation must have an "evidence", and
therefore every SequenceFeature must also? Would an "evidence" associated with
a Polymorphism something different from an "evidence" associated with
an Annotation, or does that mean that a Polymorphism with evidence would
be an Annotation? Why don't we just say that there is some notion of "evidence"
that may be associated with things, without implying anything about their
"TYPE"? Furthermore, if I had data with a "start/end" and a
"SequenceFeatureOntologyId" (not really sure what that is), but no "evidence",
what does that make it, "ontologically"? Does anyone really care, or should
we simply look at the orthogonal, primitive "data units" such as "position"
and "evidence", and let the decisions about what it "means" to have a
certain combination of these be up to the consumer?

To my way of thinking, "evidence" is one concept, "location" (start/end)
is an orthogonal concept. A "consumer" such as a map viewer does not need
to understand "evidence" in order to handle "location"; similarly, a
component that "handles evidence" doesn't need to understand "location".
I don't think that's very controversial. The controversy seems to
arise when you take away the "higher level types" that represent someone's
idea of "meaningful object classes" in terms of certain combinations of
these pieces of information. But it's these "higher level types" that
seem to be so difficult to get people to agree on!

One thing I meant to point out in my earlier presentation of this
approach is that these "data units" need not be atomic in structure,
they can be as highly-structured as is necessary to express a simple
and coherent concept. For example, one need not assert that
"starts" and "ends" are individually meaningful data units, but
instead insist on a data unit that includes both together. Of course,
there's some amount of judgment that goes on here, but
it seems to be a useful rule of thumb...

Now, back to the triple!

I agree with you that there may be good reasons for requiring a MOBY
object to have a triple, but I don't want to take it for granted
(I also want to make sure we are understanding things in the same way).

It sounds like we agree that there is a difference between
identification for purposes of this sort of mapping of input object
to output bookkeeping and identification for purposes of specifying
a data retrieval.

So, one question is whether a single MOBY construct should be doing
"double-duty", or whether it would be better to cleanly separate
them. For example, suppose that two data providers can do a retrieval
with respect to the same namespace/id (e.g. EBI and NCBI with
respect to a specific IC Accession). Though the namespace/id in
this case identifies a unique output for each provider, the
responses may be different, and when the two outputs are
combined into a single data set, the namespace/id no longer
serves the function of a unique identification in the
combined context?

Also, I wonder if the "bookkeeping" sort of identification is
essential to MOBY interactions, or whether it is another one of those
things that may be useful in certain sorts of exchanges, but not
in others? For example, suppose that I have a set of genes and
want to know the most specific GO term that subsumes them all
under the bioprocess ontology. In this case, it's not a single
output per single input situation, so presumably even if
every input has an id and the output has an id (because it's
a GO term, it wouldn't naturally have one if we were talking
about an "analytical product" like an multiple sequence alignment
or a phylogenetic tree), it's implicit
in the nature of the service that the single output depends
on the whole input set? It seems as though the idea is to
be able to cope with "batch requests" to services that
handle each "object" in the input set individually and
produce some output set for each input object. It doesn't
seem as though this paradigm is universal, and I guess I'm
just wondering if "requiring" that a "first-class MOBY object"
have such a thing is too restrictive, or whether we shouldn't
treat "unique identification" as another orthogonal concept
to be used in interactions where it's appropriate?

I should probably figure out how to express my vague misgivings
better before I take up any more bandwidth... at any rate,
this gives you some idea of how the "ultra-agnostic" approach
works with respect to defining data models...

>
> I need to send the MOBY manuscript to the group - I'll ask BIB if they
> mind me distributing it prior to publication.  In any case, I'll send it
> to you first thing tomorrow (remind me).

Consider yourself reminded. I'm very interested...

>
> Cheers!
>
> m
>
>
> _______________________________________________
> moby-l mailing list
> moby-l at biomoby.org
> http://biomoby.org/mailman/listinfo/moby-l
>

Andrew Farmer
adf at ncgr.org
(505) 995-4464
Database Administrator/Software Developer
National Center for Genome Resources