[MOBY-l] MOBY at NCGR/CSHL- intro to ISYS and its conceptual relationship to MOBY

Andrew D. Farmer adf at ncgr.org
Wed Oct 2 21:27:38 UTC 2002


Hi Mark-
Sorry it took me so long to respond to this. A couple of abortive early
attempts at a response at least made me feel like I was beginning to
understand you better. I've been terribly long-winded in places, please
forgive me.

> sure I understand everything, please read all of my statements with an
> inflection at the end ;-)

Granted, as long as you promise to take all my statements about the current
incarnation of MOBY with the same grain of salt!


> >When ISYS starts, it "discovers" ServiceProviders through
> >a simple plug-in strategy; it simply scans through a Components directory in
> >which each component provides a basic structure for providing its resources,
> >
> This is the ISYS analogue of MOBY Central, ja?  As I understand it, the
> key differences are that there is no registry "thingy" per se, just a
> directory.  Service Providers are transiently "registered" as they are
> started/stopped through their individual GUI's (or whatever), and
> "registration" involves putting interface descriptions into that
> directory.  Is that right?

Pretty much. The ISYS startup machinery automatically "registers" all the
ServiceProviders it finds via its directory search (there's not really
any GUI mechanism involved here, unless you're thinking about the
place where you can activate/deactivate ServiceProviders...)

In terms of analogy to MOBY-Central (as I understand its current
conceptualization), I would present the following picture:

MOBY-C function					ISYS
---------------					----
Datatype ontology				IsysAttribute interface hierarchy

Service ontology				Service interface hierarchy (well, given the
									discussion below, it's not really
									such a good analogy!)

Service registration			"Static" services registered through
									ISYS queries to ServiceProviders at
									startup; "dynamic" services never really
									"registered", as the request matching
									isn't done by the platform...

Service request matching		"Static" services requests matched to
									service implementation by ISYS platform;
								"Dynamic" services matched to requests by
								the ServiceProviders themselves.


>
> > gets the information about which classes represent ServiceProviders,
> >
> as opposed to...??  What else is in that directory?  Which "components"
> represent ServiceProviders,  or which "classes"?  (I'm just trying to
> clarify the terminology)  As I understand it, Components implement
> Classes, and one of the Classes that can be implemented is
> ServiceProvider... or am I completely up the creek?

I'm not sure how much relevance this has to MOBY, but since
you asked...

Typically, the directory contains everything that the component needs to
have locally in order to be used (although some components have been implemented
to assume that the application is already installed independently of ISYS,
and then it just contains the ISYS wrapping bits).

In my typical usage of the terms, a "Component" is the "unit of pluggability"
for ISYS, i.e. either the Component is present in an ISYS installation,
or it isn't- you can't put in half a Component, even if that single bundle
provides multiple functionalities that are logically separable.

Every "Component" will have (at least) one class (sensu Java) that implements
the ServiceProvider interface (sensu Java). If it doesn't have a
ServiceProvider, then there is no way for ISYS to know about it.
Even GUI components will have a class that acts as the ServiceProvider,
providing one or more types of services that will ultimately create
instances of the GUI.

I guess there may be a little Java/Perl confusion here. I'm not sure that Perl
has any notion of "interface" (it doesn't make all that much sense for an
interpreted/loosely-typed language)? In Java, an interface is basically a
set of method declarations with no associated implementation of those
methods. A class can "implement" one or more interfaces by providing
implementations for those methods. It's like subclassing, except it allows
you to have DAGs in your inheritance hierarchy by allowing multiple
interface implementation (whereas you can only subclass a single class
and inherit overridable implementations of its methods). Note that a lot
of my discussions of the ISYS data modeling approach refer to this notion
of multiple interface inheritance, so if there is confusion here, please
let me know...


>
> >Services in ISYS:
> >
> >Services and Service brokering in ISYS come in two distinct flavors:
> >"static" and "dynamic".
> >
> >The former is the more traditional approach by which specific well-structured
> >interfaces are defined (e.g.  a "RetrieveSequence" service with specified
> >input types (Identifier) and output types (SequenceText));
> >
> so currently you would define MOBY services as being "static"... and
> yet... not quite.  This scenario seems to differ from MOBY in one
> critical way - that our interfaces are "modularly" defined, rather than
> being well-structured:  A "RetrieveSequence" service type would not
> exist in MOBY.  Rather we would have a "Retrieve" service that had an
> output type of "Sequence", and an ~arbitrary input type, depending on
> what the service provider needed. .When a service is required, its
> interface is dynamically discovered (WSDL), but we avoid the wild and
> wholly world of data types by defining the data formats and requiring
> that one or more of these be passed... so a Client (or component)
> doesn't have to be specifically designed for that interface.

OK, now it's my turn to ask questions! It sounds from what you say here
and below that I may have misinterpreted the MOBY "service ontology" a bit.
Would it be correct to say that the service ontology as you conceive it
makes no constraint with respect to "structure", but merely supplies certain
"semantics" for the services associated with it? That "structural constraints"
(i.e. input/output declarations) are a function of the service description
(WSDL), but that this has no necessary relationship to the "ontology"?
Or are you saying that the "Retrieve" service in the ontology is without
constraint, but that there would be a child of this that would be a
"class of all services that retrieve Sequences" which would constrain its
children to return Sequences (but would not constrain the input type)??
I guess we shouldn't dwell too much on the details at this point, but
I do think it's important to make a distinction between structure and
semantics, and understand how these concepts relate to "ontologies".

I'm not very clear on some of the details of how you are thinking
about the service descriptions. It sounds as though you're basically
thinking about these as "input type"/"output type" characterizations, and
that this will be used by MOBY-Central as one basis for service request
matching?

Would you allow a service to have multiple orthogonal inputs, in the sense
of the "arguments" to a function call, or is your notion of input more or less
a monolithic "input data set", e.g. would a pairwise alignment service describe
itself as having two inputs with type sequence, or would it just ask for a
"data set containing sequences"? It seems as though the implication of the
latter is that the service requester doesn't have to understand very much
about the service interfaces, and on the other hand, is not allowed to
exercise much intelligence in terms of service invocation.

As you suggest, it does seem very much as if your current conceptualization
of MOBY services is a little bit between the static/dynamic divide of ISYS.
On the one hand, services are required to give a little bit of signature-like
info (I take X, and give back Y), for the purposes of service request matching;
whereas they don't describe themselves very richly in terms of signatures,
meaning that most of the interpretation of how to intepret the request with
respect to the input rests with the service, rather than with the requester.
So that, for example, a client that got back a service description whose input
was "sequences" would at least know that it might as well not send any data
that didn't include sequences, but it wouldn't have any control over how
those things were interpreted in the service request (e.g. for a service that
might treat input sequences asymmetrically).


>
> >in this case, a component that wants to use that specific functionality
> >will have been designed with knowledge of the service interface, and will
> >simply ask ISYS to provide it with an implementation
> >
> This is also not strictly true in MOBY.  For us, the "component" ( which
> I think is a "Client" in MOBY speak) does not need to be designed with
> knowledge of the service interface per se.  The interface -
> input/output/URL - is dynamically discovered using WSDL, and in fact can
> be completely ambivalent about what type of service it is dealing
> with... i.e. a client doesn't have to know what to *do* with a Sequence
> in order to happily retrieve it and pass it on to the next service... so
> service implementation is not a problem.  *Representation* of data, on
> the other hand, will have to be coded for each data type (or at least,
> each of the basic parent data types)

Again, this somewhat depends on the "constraints" placed upon the
interface description. If you think about every moby service as being
describable in terms of single input type/single output type, I guess
this seems fair. It's not totally clear to me how you're imagining the
type declaration. For example, if I am a service like Entrez, how do
I "type" my output if it corresponds to many different pieces of the
moby data ontology, e.g. sequences, keywords, species info, annotations?

These questions are related to the whole issue of
"strongly-typed/highly-structured" vs "loosely-typed/semi-structured" that
I'll belabor at the end of this missive...


>
>
> > (either a user
> >specifiable default, or a list of all known implementations) of that
> >service. These service types are specified as interfaces (along with
> >specifications of datatype interfaces for their input and output types)
> >in a special package that acts as a kind of catalog of static
> >services; in this way, it is somewhat analogous to a certain way of looking
> >at the notion of a service "ontology" at a central MOBY registry
> >
> this is actually much closer to the myGrid service ontology definition,
> than it is to the current MOBY view of a service ontology, as I said
> above...  I see the power of both approaches, and I still haven't come
> to a conclusion about which is better.  I think, that our approach is
> more flexible w.r.t. client/service design, but we suffer badly from
> having little or no machine-readable logic.  We rely on a human to look
> at the input and output types, the name/description of the service
> type,and decide for themselves what exact transformation is happening in
> between.  e.g.  if you give me a PubMed ID, and I offer to give you back
> Sequence objects, and the service type is "Retrieve"... what does that
> mean??  It could mean "Give me the sequences that were published in this
> manuscript", or it could mean "give me all of the sequences that were
> published by the author of this manuscript".  In this sense, the current
> MOBY situation is a total nightmare and needs some serious tightening
> up!!!!  We can't make the semantic bioweb without some additional
> ontology(ies) sitting around somewhere to more clearly define our
> services, rather than the current human-readable descriptions.  At the
> same time, access to these orthoganal relationships between data
> (PubMed/ID --> Sequence) are, in my mind, one of the most wonderful
> aspects of the MOBY system and I am loathe to lose them altogether...

My impression of the myGrid approach was slightly different (though I've
never spoken to them, just read a paper on their site). It sounded to
me as though they were arguing for an approach whereby services were
described according to certain well-defined properties, and that
hierarchical "ontologies" could be inferred from these descriptions. So,
the particular hierarchy we presented in ISYS for static services might
be just one of many possible ways of deriving a classification system
from the service descriptions. I really like that notion, and think it's
well worth exploring in MOBY...


>
> >We haven't actually used this mechanism of service specification
> >that heavily in ISYS, having found the "dynamic service" paradigm rather
> >more powerful in the context of the ISYS client-orientation.
> >
> okay, here we go :-)   Let's figure out if this additional power can
> work in the MOBY System - it sounds like we are already sitting
> somewhere between your "static" and "dynamic" models in any case...

Exactly my reason for bringing it up...

>
> >components may provide implementations. There is no conceptual reason
> >why it couldn't be a deeper hierarchy, we just never found it to be useful
> >to abstract things out to higher levels.
> >
> To be honest, we haven't really thought clearly about a service
> "hierarchy".  We started to define basic service types ("Retrieve",
> "Blast", "Alignment"), and they form a loose hierarchy ("Blast" might be
> a child of "Alignment"), but there is no way to sensibly name a service
> like the PubMed->Sequence that I suggest above except to call it a
> "Retrieve"... which isn't very useful.  So, I think we have found the
> same problem as you have...  as soon as you try to go beyond basic
> service types the numbers become overwhelming, in particular when we
> think of all of the orthoganal slices we could make through the data...

Exactly. I think a more reasonable approach is bottom-up, whereby you
start with the "empirical facts" of the actual service implementations as
they stand, try to establish some reasonable language for describing them
using a common vocabulary, and then let hierarchical organization schemes
evolve from there. Or something like that...

BTW, I think many the same issues are relevant with respect to definition of
data types...



>
> >Dynamic services differ from static services in several important ways.
> >First, the interface for dynamic services is very generic, and totally
> >encapsulates the "semantics" of the service. There is a distinction between
> >"DynamicDataService" and "DynamicViewerService"- the former returns data,
> >the latter provides a visualization (i.e. a "Client")
> >
> A MOBY client would, presumably, fit as one of your DynamicDataServices?

I wouldn't have thought so, but perhaps you can elaborate? Did you mean
MOBY services?

>
> >, but other than that,
> >dynamic services are almost totally opaque to the system in terms of
> >what exactly they are doing (although they are required to provide a
> >descriptive String so that the user gets a sense for what it is he/she is
> >invoking).
> >
> so we have the same problem :-)
>
> >Second, the inputs and outputs of a dynamic service are similarly
> >opaque to the system;
> >
> like MOBY.

I'm not sure if this is true of MOBY or not. If the "input type"/"output type"
"signature" of services is being used to do service request matching, I would
claim it's not true, i.e. MOBY is constraining services to explicitly
describe themselves in this way so that it can implement the service
matching itself. The ISYS dynamic approach, on the other hand, basically only
constrains a dynamic data service to accept data and return data, and the
matching of services with data is encapsulated in the service provider.


>
> >When this occurs, ISYS simply passes around
> >references to the data set to each of the registered ServiceProviders, and
> >asks them to inspect the data and return the set of dynamic services that they
> >provide that could be used on the dataset. This inspection can be as simple
> >as looking for data of a certain type (e.g. identifiers in a certain namespace)
> >or more complicated (e.g. looking at the lengths of the sequences provided, or
> >the value of a species attribute). The main point is that the "service matching"
> >is totally encapsulated in the ServiceProvider, and does not depend on some
> >third party "matchmaker" like UDDI.
> >
> This is interesting, and very different from MOBY.  It's much more "P2P"
> than we are, and it does give you some abilities that we don't have.
>  e.g. we pass around only the name of the object when looking for
> services, so service providers can't "inspect" the object until they
> have already been selected.  This can lead to hiccups such as the one
> that Lincoln raised at my BOSC presentation where a service may say it
> can use an object, receive the object as input to a service transaction,
> and then discover that it can't really use it at all...  This isn't a
> *critical* problem, but it is, as I say, a hiccup that a Client needs to
> be aware of.  In your system, presumably, this cannot happen.

It can happen! It's really up to the service provider to decide how
"fussy" they want to be about inspecting the data before they decide to
present services for it. For example, suppose you had a sequence analysis
service that would only operate on sequences in a certain range of lengths;
at "service discovery" time, you might merely check the given data set
to see if it had sequences, without bothering to test the lower-level
conditions.

The fundamental difference I see is that the "service matching" is not
done at the "central authority"; instead it's encapsulated by the
providers of the services. Thus, there is no need for a service description
language to be expressive enough to encode the service matching algorithms
for use by a third-party service matcher; also, since there is no
"exposure" of the details of the service matching, there are no dependencies
that must be considered when the details of the matching change due to
changes in the implementation of the service itself.

I'm not sure that it's truly "P2P", as the relationship between service
requester and service provider is still pretty asymmetrical, and the
service requester doesn't present any interface for the service provider
to use. A more "P2P" approach might be to have the service requester roughly
desrcribe its data, but allow the service provider to ask it further questions
about the data content when necessary...


>
> > Of course, the fact that we're only passing
> >around references to objects in memory makes this much easier than doing a
> >similar trick on the network, but one can imagine an analogous mechanism for
> >a MOBY-like system; for example, if a MOBY client simply sent out some simple
> >representation of what data types were present in its input set, that would
> >probably be sufficient for most providers to do a reasonable job of presenting
> >their relevant services.
> >
> indeed...  we'd have to pass around the base MOBY-Triple at least
> (instance/namespace/id).  The service provider would then know what type
> of data would be contained in the object (from the instance), what
> namespace it falls into, and moreover, what ID it has.  This latter
> point is the one that we are currently missing - the fact that it would
> be ridiculous for a service provider to register each ID number that it
> knows about in the Registry to ensure that it is never passed something
> it can't deal with.  Currently, service providers register only object
> type (instance) and namespace (optional), and if they get something sent
> to them that matches those criterion, then buyer beware!

Oh, so namespaces are actually registered in the ontology? That wasn't
really clear from the documentation. That's similar
to the approach we took in ISYS, i.e. we create static subtypes of Identifier
based on namespace (e.g. IcAccession, GiNumber, etc.). It has some
advantages and some disadvantages along the classic static/dynamic "typing"
divide, but I think the advantages mostly prevail in this case...

>
> I wonder, though, if the overhead of passing larger objects, multiple
> times from P2P, and having the service inspect these each time, is worth
> the pain for the gain?  I guess it isn't so much larger a message than
> other P2P broadcasts (so long as we broadcast only the triple, and not
> the payload), but its still more network traffic than we currently have.

There is certainly a cost/benefit threshold, but I think we need to explore
various alternatives. I certainly don't advocate passing around the entire
data set over the network for the purposes of service request matching, but
I'm not sure that a simple type oriented approach to describing content
is flexible enough for this field. More on this in the "data modelling"
discussion at the end...


>
> >(Note that there may be some fuzzy ground here between
> >the notions of a "type" and a "value"; for example, if one uses the LSID
> >structure (as I understand it), the "namespace" is a property of the "value"
> >instead of the "type" of that data, but would probably be critical in matching
> >retrieval services to the data;
> >
> in MOBY we consider namespace a data type, rather than a value.
>
> >another example would be a "sequence", for
> >which the "alphabet" used by the sequence could be encoded into a subtype or
> >simply viewed as a property of the sequence "value".)
> >
> this is exactly the level of complexity that I was hoping to avoid at
> the registry (discovery) level.
>
> >Though I certainly agree that
> >having services that are more self-descriptive will be valuable, I don't think
> >we should rule out the possibility of exploring alternative approaches to
> >the service brokering. For example, one could imagine "MOBY Central" as
> >being nothing more than a registry of distributed "ServiceProviders"
> >(and probably the registry of the "ontologies/vocabularies" of data types
> >and service types or service descriptors).
> >
> So in this way it breaks the P2P paradigm in that you *must* connect to
> MOBY Central first in order to discover service providers, rather than
> discovering service providers through broadcasting over the P2P network?
>
> ... if so... what have we gained (other than the ability of the service
> provider to inspect the data... which is in itself significant!)

I guess I don't see the details of how references to ServiceProviders are
obtained as being too essential to the argument. A client could
obtain its list of ServiceProviders in different ways (registry lookup,
P2P broadcast, bookmarked list, etc.) without affecting the
nature of the way in which it gets services "advertisements" from those
providers.  As I described above, it seems to me that the significant
difference of this approach lies in removing the service matching from the
central authority. It's worth thinking about...


>
> >(Note that I'm deliberately trying to paint this picture without
> >using SOAP/WSDL for the time being, although one could imagine using those
> >as well....)
> >
> sure
>
> >We can explore the pros and cons of the various approaches in subsequent
> >discussions, but I hope this helps get people thinking in different
> >ways about the problem.
> >
> let's start exploring them now :-)
>
> Although I see the power gained by a more P2P architecture, I think a
> couple of things (important things) are lost by going this route:
>
> 1)  Simplicity of service provision

Not sure about this. It seems to depend on whether it's easier for a service
provider to do their own assessment of the matching of a request to their
services, or to require them to describe themselves in such a way that
a third party can do the brokering. One could imagine a boilerplate cgi
script that would (for example) allow the person to fill in a mapping from
moby data types to the relevant services.

An alternative concern would be not having a standardized approach for
doing the matching; not allowing the client any control over how this
matching was being done...

> 2)  Semi/fully automated workflow discovery (finding a path from an
> input to an output data type through multiple services)
>
> The latter can probably be accomplished using the brokering approach you
> describe, but it isn't as straightforward and (as best I can imagine in
> my current state) would have to be accomplished by possibly endless
> trial-and-error traversals of many dynamically discovered service paths.
>
> The former point however, the one of simplicity, might be more important
> at the end of the day as this will affect the acceptance/adoption of the
> system by the people whom we need to make the whole thing work...
>

Agreed.

>
> > One thing I would like to point out, however, is
> >that the different approaches to service representation/service discovery
> >are not necessarily mutually exclusive. For example, I have often found it
> >useful in ISYS to define a "static service", but to allow that same service
> >to be provided dynamically, by simply writing a little code that does a
> >reasonable translation from the "self-descriptive" representation of the
> >data to the representation prescribed by my own implementation. I'm beginning
> >to wonder if a more "self-descriptive" and finer-grained approach to
> >service-typing than the "fat interfaces with signatures" model might be useful
> >to bridge these alternative approaches (possibly similar to WSDL, but I need
> >to look at that more);
> >
> I'd like to pursue this idea further, as I'm not sure I am understanding
> what you suggest as the "middle ground" here.  Please expand on this...

It's not very clear in my mind, either, but I'll give it a shot. It seems
as though the "static/dynamic" extremes could be characterized by:

"dynamic service": totally opaque/encapsulated, you give it the data and it
"decides" how to map it onto its internal model, but you have no control over
this and no real way (other than the human-readable description, or assessing
the nature of the data it returns) of understanding what the service "is";
the implementor of the service has made no real "contractual obligation"
in terms of the interface, so they can change its behavior to their
heart's content without "breaking" anything (although, presumably, certain
behaviors would be more useful to consumers than others!)

"static service": fully specified (at least in terms of structure) by an
"API-like" declaration; semantics are presumably somehow defined by a
common understanding of the service name and its parameters, possibly also
by a place in a hierarchy; you have to understand this signature enough to map
your data onto it, which gives you total responsibility and full control
over how you use it; the structural spec/semantic understanding should
be viewed as something of a "contract", not to be altered without
mutual agreement.


The "middle ground" that I dimly see here is that if services "describe"
themselves, not in a "monolithic" signature-esque way, but more in terms
of individually understandable "pieces", then service requesters could
"fill in" the pieces that they understood, and leave the rest to
the implementor of the service.  There would be nothing to prevent such
a service from providing a signature-like access, for a more traditional,
"rigorous", compiler-oriented approach, but you'd also be opening up
an interesting middle ground whereby dependencies would be ameliorated,
and different levels of "understanding" could be accomodated.

In fact, you could even "open up" the process of fully specifying a service
request to include other actors besides the requester and the provider. For
example, I could dimly imagine a scenario in which a "dumb client" had
a set of sequences with species identifiers and wanted to do gene finding.
A particular gene finding service might be parameterizable at the
level of kingdoms, but did not know how to map species to this level
of taxonomic characterization. Supposing that one or the other parties
in this interaction knew about another service that would take "species"
and translate this into "kingdoms", a more intelligent service request could
be created through the mediation of this component. I know, it's all rather
vague, but hopefully gives some idea for directions that could be
left open using this approach...


Does that help to clarify, or just muddy the waters some more?


>
> >I
> >think it is worth thinking about different models of self-documentation that
> >might be more flexible and useful in a decentralized environment than those
> >provided by interface signature definition.
> >
> Okay  myGrid folks!!  Jump in here!!!  :-)
>
> >  They may be related to one another via inherirtance (e.g.
> >SequenceText extends "LinearObject", which is merely an abstraction of a
> >thing that has "length");
> >
> I have been thinking about this type of relationship as well... I think
> we could do with having a third ontology  looking after "representable
> as" definitions (I wouldn't be thrilled to put them in the same ontology
> as the data types themselves).  In this way, client programs could be
> designed in a more generic way.
>
> e.g. there could be a viewer for a "LinearObject" datatype, where
> "Sequence" data, and "BlastHit" data are both viewable as LinearObjects.
>
> ...this is something that has been kicking around in the back of my head
> and isn't well thought out yet since it is far from being the most
> critical issue at the moment... so if that makes no sense please ignore
> it :-)

Umm, it doesn't really make sense. I got the impression from the
moby_classes document that you were already following this idea? That is
to say, I interpreted the "inheritance" of Sequence from VirtualSequence
to imply that any Sequence would also state its length?

>
> >At the next level up from this is the somewhat infamous (around here, anyway)
> >IsysObject, which simply wraps an arbitrary set of IsysAttributes to indicate
> >a level of "objective coherency", and provides querying mechanisms for
> >getting at attribute content of interest. I should point out that what we
> >were basically trying to achieve here was a somewhat more flexible way of
> >representing multiple inheritance of IsysAttribute interfaces than requiring
> >developers to statically define a class that would implement a specific set
> >of these IsysAttribute types, so in some sense, it's not really all that
> >much weirder than the notion of multiple interface inheritance.
> >
> right... we allow for (but have never yet used) an object that is
> constructed in this way... although we still require that the objects
> schema be registered as a new datatype, even if it is a composite
> object!  So we are a bit tighter here...
>
> >("Premature optimization is the root of all evil"- Knuth)
> >
> love it!
>
> >The main questions I think we need to explore
> >have to do with the nature of the "data type ontology" that will be serving
> >as a common language. My own feeling is that it is much more
> >important (and more feasible) to develop a fine-grained vocabulary along the
> >lines of the IsysAttribute catalog than to try to get agreement on the correct
> >structure of higher-level objects.
> >
> You haven't yet convinced me of this, but I need to spend more time
> reading about your IsysAttributes :-)  Arbitrary collections of
> "thingys", even if we have a fine-grained list of valid "thingys", seem
> to be much less robust and prone to misinterpretation than a properly
> defined object...  and since our objects are minimlalist as it is, this
> does not seem less feasible than the approach you are suggesting...  In
> addition, we gain some assurance about the composition of an object in a
> backwards-compatible way by examining its parentage.
>
> >This seems to be somewhat at odds with the
> >picture that is presented in the moby_classes.txt file (available from the
> >biomoby cvs tree);
> >
> I should have removed that from the CVS - please ignore it as it does
> not contain any "valid" data.
>
> > I should do something similar
> >for the IsysAttribute hierarchy as it currently stands, but I want to clean it
> >up a bit, as I think it contains some needless complexity and certainly many
> >artifacts). I also think we should give consideration to the
> >"dynamic/compositional" style of growing more complex data types
> >(which is more natural to XML than it is to Class definitions).
> >
> errrm....  I understood our object models to be exactly that - dynamic
> in their composition... so long as new complex data types are
> registered...  Am I misunderstanding you?

I'm probably misunderstanding you as much as you're misunderstanding me.
I'll try to make myself clearer at the end of this already lengthy response.

>
> >At any rate, I've already violated my intention not to overwhelm
> >your attention span. Please let me know if you have any questions, comments,
> >etc. Hope it helps...
> >
> It is now 8:15!!   I managed to spend 5 1/4 hours writing this response
> (interrupted by cat and wife, both wondering why I was still on the
> computer at this time of night), and I am now well and truly knackered!
>  But it was great that you went into so much detail as it forced me to
> go out and do some additional reading in order to understand why you
> designed ISYS the way you did,  and why you espouse (or at least, keep
> an open mind about) this alternate approach to service description and
> discovery.  I'm certainly open minded about many of the ideas you bring
> up, though not entirely sold on them yet ;-)
>
> It's time we had another MOBY DIC meeting!   Emma Lake will be frozen
> over soon, so we should probably think about meeting elsewhere.  I
> should soon have a limited travel budget as our MOBY funding will start
> to flow in a few weeks.  Lukas, are you still interested in hosting the
> meeting at Carnegie?  (no pressure!  I'm just asking because you brought
> it up as a possibility last time we met...)
>
> In any case, we can continue to discuss these things online for the
> moment.  I'd particularly like to get the input of the myGrid people, as
> some of the issues raised here seem to fall right into their lap w.r.t.
> plans and architecture.
>
> It's nice to be discussing MOBY in the absence of underlying
> implementation issues such as UDDI/SOAP - I think a discussion of the
> bahaviour of the system is badly needed right now before we get
> ourselves locked into something we might regret later...
>
> I'm going to bed!
>
> good night all,
>
> M
>
>
> From markw at illuminae.com Mon Sep 30 09:41:20 2002
> Date: Sun, 29 Sep 2002 07:23:04 -0500
> From: Mark Wilkinson <markw at illuminae.com>
> To: Andrew D. Farmer <adf at ncgr.org>
> Cc: moby-l at biomoby.org
> Subject: Re: [MOBY-l] MOBY at NCGR/CSHL- intro to ISYS and its conceptual
>     relationship to MOBY
>
> next morning...
>
> >This seems to be somewhat at odds with the
> >picture that is presented in the moby_classes.txt file
> >
> Sorry, my mistake - this file *does* belong in the CVS, as it is Lukas'
> first draft of a set of basic objects and their relationships.  I got
> this confused with a file I have locally which has (in a single
> monolithic file) the draft versions of the XSD definitions of all object
> classes I have attempted so far.
>
> I guess we were on a slipperly slope, in that we are trying to guess
> what is the most fundamental piece of information to be included in an
> object (e.g. Sequence objects carry Sequence, and Citation objects carry
> Author/Publication information) without getting into the MAGE-ML scale
> of modelling where you want everything that is known about a piece of
> data to be inlcuded in the model.  It's a tricky game to play, I agree,
> though I don't think it is yet proven if we have succeeded or failed, as
> we don't have enough services nor use cases to make any conclusions.  I
> think we *can* make some pretty accurate guesses about what the "basic"
> information requested will be...  To go along the other path strikes me
> as dangerous, actually.  If we don't structure the data in some
> predictable way, but rather 'bag it' (as I understand you are proposing
> - correct me if I am misunderstanding), then we are quite literally
> forced to go the route you proposed were the service must examine the
> content of the 'bag' to see if it can do anything with it.  This worries
> me...  I caught the tail end of a meeting of the I3C a couple of weeks
> ago and walked in on a conversation about exactly these kinds of issues
> - the advocate was arguing quite strongly and convincingly that
> structured data is the only way that the problem is going to be solved,
> otherwise we end up with a similar (but less severe) problem to what we
> have now with scraping CGI pages to get at only the pieces that we want,
> though granted ours will be at least somewhat self descriptive...
>
> I guess I need you to give me an example of  what you are proposing,
> since I might be fanning my flies without having a clear understanding
> of what you are saying.


OK, I'll admit that this is one of the most contentious and more arcane
aspects of the "ISYS way". You're certainly not the first person to
raise their eyebrows at the proposition. And I would not claim that the
implementation used in ISYS of the underlying concepts is necessarily the
best one; in fact I hope that we will find some better ways of doing things.
But, on the other hand, I do think there are some valuable insights embodied
here, and want to make sure we are at least on the same page as far as
the basic issues are concerned.

So, one way of looking at the problem is to ask who best defines
"what is the most fundamental piece of information", the provider of the
information, or the consumer of the information? From my point of view,
it is the consumer of the information. If information isn't in a form
that can be used, what good is providing it?

For example, most sequence analysis services don't fundamentally care
whether or not a sequence has an identifier, they just need a String the
represents a sequence- some clients might generate their own sequences to
test certain properties of the analysis algorithm; whether or not they need
to generate their own namespace for these sequence strings probably depends
on whether or not the service accepts "batch requests" and whether or not
they need to associate properties of the output with properties of the
input. This is meant to illustrate the notion of "interface segregation",
i.e. even if every "retrieval service" is inclined to associate ids with
sequences, by thinking about it from the analysis service's point of view,
one is lead to conclude that "having a sequence" is separable from
"having an id".  On the other hand, anything that "has a sequence" presumably
must also "have a length", so it seems fair to couple these concepts through
inheritance, as we both have done in our separate models; the "test of
independent invention" at work!! I have deliberately singled out the
"id" here, since you've put it at the root of the whole moby_class hierarchy.
While there may be other reasons for insisting on an id for all objects
in the system, in terms of its use as a "lookup", it really seems only relevant
for services that want to do a retrieval of information using that key.

Now, there will surely be consumers of data that do not operate
at the "lowest common denominator" level. For example, some gene finding
algorithms would very much like to have not only the sequence text
itself, but also some indication of the taxonomy of the sequence. Assuming
that we have already decided that "sequence" and "species" (or some other
taxonomic identifier) are independently usable "units" of information by some
theoretical consumer, how should we go about combining these two concepts for
consumption by the gene finding service? One approach would be to use the
"multiple inheritance" notion, e.g. define a "TaxonomicallyTaggedSequence"
construct that ensures that both units of information are present and
"coherently associated".

This is great from the point of view of the gene finder; it seems quite
reasonable to assume that if someone has gone to the trouble of implementing
this "type", they really mean "a sequence with the taxon from which it came"!

On the other hand, it's not so great from the point of view of data providers.
For one thing, it may be the case that sometimes the two attributes can be
found together in its objects, and sometimes they aren't (not likely in this
example, but other attribute combinations are more compelling- "gene name" and
"ec number" is a classic example). If this is true,
it's much more convenient for a data provider to simply describe the set of
"data units" that happen to be present for a given object instance.

A much more compelling argument comes from the problem of combinatorics.
First off, to be perfectly clear, assume we have the following type
system (I'd draw it as a DAG, but my ASCII-art skills aren't very good):
	interface Sequence
	interface TaxonomicTag
	interface UniqueIdentifier

	interface TaxonomicallyTaggedSequence extends Sequence,TaxonomicTag;
	interface TaxonomicallyTaggedUniquelyIdentifiedSequence
		extends Sequence,TaxonomicTag,UniqueIdentifier;

If I have an instance of type TaxonomicallyTaggedUniquelyIdentifiedSequence,
and ask whether it is of type TaxonomicallyTaggedSequence, I will be told
"no" (at least, this is how Java works...), even though everything guaranteed
by a TaxonomicallyTaggedSequence is also guaranteed by a
TaxonomicallyTaggedUniquelyIdentifiedSequence!

Now, as a data provider, I have no reason to assume that "sequence" + "taxon"
is an important combination of data for any consumer. Assuming that there
are in fact a combination of "data units" that are constant in my "schema",
I would be likely to assert that this particular combination
("TaxonomicallyTaggedUniquelyIdentifiedKeywordAssociatedSequence") is
a "type". Perhaps I have been broad-minded enough to multiply inherit
this type from each relevant "data unit". Even so, what on earth would possess
me to multiply inherit this type from every possible sub-combination of
these "data units"? To me, "TaxonomicallyTaggedSequence", is a rather
arbitrary selection of two of the "data units". Thus, when I pass my
data object off to the gene finder, what should it do: refuse to
have anything to do with my data which doesn't explicitly declare itself as
"TaxonomicallyTaggedSequence"? or, "infer" from the fact that I have tagged
my data with the two relevant "data units" individually that this is
equivalent to the explicit combination of the two into a single unit?
I'm arguing for the latter decision...

Now at this point, it doesn't seem to me that we have gone too far along the
"slippery slope" of "loose bagging"; we still have well-defined objects that
are using multiple inheritance in a reasonably straightforward way. All we
have argued is that a providers/consumers of data adopt a convention that
data be described and interpreted in terms of the "interface segregation
principle". Note, however, that since we can't get compilers to adopt our
convention, we lose some "type safety" in terms of compile-time checks for
consistency.

The IsysObject approach merely takes this one step farther, and argues that
the use of the static type system is a sort of an "implementation detail".
IsysObject basically serves to "encapsulate/abstract" the notion, so that
it doesn't matter to the system whether in fact you have one object implementing
multiple interfaces, or several objects implementing one or more
interfaces. The following points may help to clarify the argument.

First, by adopting the "segregated interface" interpretation convention,
we've already abandoned the compile-time advantages to static typing,
so we're not losing anything on that front.

Second, we're not really "typing" in the classical sense of specifying
"behavior" through interfaces, it's more like defining data structures
with semantics (kind of like XML).

Further, not every language supports the same constructs, though clearly
the ideas we're striving for aren't really language-dependent; in fact,
I think that XML represents the ideas here much better.

So, maybe our subsequent discussions should forget about these notions of
interfaces and multiple inheritance, and focus on the XML angle. Although
it's certainly possible to define rigidly structured data representations
using XML, it seems to me to miss the point!

Well, I could go on and on in this vein, but I get the feeling that I'd do
better to engage in more focused dialog on these issues than keep writing
monographs on the subject...

I sincerely hope you don't lose any sleep over this one!

Andrew Farmer
adf at ncgr.org
(505) 995-4464
Database Administrator/Software Developer
National Center for Genome Resources





More information about the moby-l mailing list