[MOBY-l] MOBY at NCGR/CSHL- intro to ISYS and its conceptual relationship to MOBY

Fri Sep 27 17:36:26 UTC 2002

Hi all-

The NCGR/CSHL groups have recently started our funding from NSF for the
MOBY project, and we met earlier this week at CSHL to develop a plan on how
we'll be proceeding. I have chatted a little bit offline with Mark Wilkinson
about coordinating our efforts, and we're both very anxious to keep the
MOBY vision from getting fragmented (as well as beginning a concerted effort
to clarify what that vision is!), while following through with our separate
commitments to individual funding sources.

Before I get too far, let me first give you a little background on where I
fit in to the picture, since this is my first post to the list. I have about
80% of my time funded for MOBY, so I hope to be heavily involved in
articulating the vision and realizing it in an implementation.
I've been on the ISYS project (which helped to inspire the MOBY idea) since
its inception, and have been closely involved in it throughout, so I have a
good understanding for what we were trying to accomplish, details of its
design and implementation, and a pretty good feel for its strengths and
weaknesses. Since signing on for MOBY, I've been doing some reading into the
world of "web services" and am beginning to develop some sense for the basic
issues and efforts going on there; also, I've read through the archives of
the moby list to get some sense for what folks here have been working on,
and where it's heading.

I wanted to take a little time to present a slightly more detailed account
of some of the concepts underlying the ISYS approach to integration, and to
kick off discussions on how these concepts might relate to the MOBY project.
I don't want to overwhelm you with information or give the impression that
I want to force anyone to adopt the ISYS approach, but I do think it might
be useful to offer some insight into some of the foundations of my
perspective on certain issues...

Basic Architecture:

The ISYS Platform consists of two distinct pieces, a ServiceBroker and
an EventChannel. Both of these have to do with mediating component
interactions, so that components can interact without having any direct
knowledge of one another ("loose coupling"). Components can be divided into
two classes, ServiceProviders and Clients. The former provide specific
functionalities (services) for users of the system, such as data access,
analysis or visualization. The latter represent components that participate
in event-based communication (e.g. visual synchronization of GUIs); they
are generally GUIs, although there is no necessary restriction for this
(e.g. one could imagine data management components that communicated changes
to their content via events). Note that ServiceProviders and Clients are
not necessarily mutually exclusive concepts; for example, in some recent
work I did to wrap the GO DagEdit tool, a DagEdit Client when loaded with
an ontology would provide a service to search that ontology with terms
supplied by other components.

For the purposes of MOBY, I think we can pretty much ignore the
Client/EventChannel part of the picture, and concentrate on ServiceProviders
and Services. When ISYS starts, it "discovers" ServiceProviders through
a simple plug-in strategy; it simply scans through a Components directory in
which each component provides a basic structure for providing its resources,
gets the information about which classes represent ServiceProviders, and
creates instances of these classes using a special classloader mechanism
(I won't bore you with the gory details of this!). This means that
ServiceProviders must be present locally in the ISYS installation, although
they may communicate with non-local resources; the distributed nature of
the MOBY service providers will have some implications in terms of how
we approach things here.

ServiceProvider is an interface defined by ISYS, and implemented
by the components in whatever way is most appropriate for them; although
a component must provide Java implementation classes of the basic ISYS
classes, these may merely represent stubs that communicate with other
non-Java processes (e.g. remote servers, command line scripts, etc.). The
basic methods prescribed by the ServiceProvider interface allow ISYS to
obtain information from it about services that it provides and (in some
cases) the dependencies that it has on other services (not specific
implementations, just the abstract interfaces).

Services in ISYS:

Services and Service brokering in ISYS come in two distinct flavors:
"static" and "dynamic".

The former is the more traditional approach by which specific well-structured
interfaces are defined (e.g.  a "RetrieveSequence" service with specified
input types (Identifier) and output types (SequenceText));
in this case, a component that wants to use that specific functionality
will have been designed with knowledge of the service interface, and will
simply ask ISYS to provide it with an implementation (either a user
specifiable default, or a list of all known implementations) of that
service. These service types are specified as interfaces (along with
specifications of datatype interfaces for their input and output types)
in a special package that acts as a kind of catalog of static
services; in this way, it is somewhat analogous to a certain way of looking
at the notion of a service "ontology" at a central MOBY registry (although
it's not the only way of looking at how the latter ontology might be
used...). Technically, there is no reason why these interface definitions
need to be in the same "package space", it was simply our way of defining
a central place where developers could check to see if the sort of service
they wanted to use or provide had already been specified.

We haven't actually used this mechanism of service specification
that heavily in ISYS, having found the "dynamic service" paradigm rather
more powerful in the context of the ISYS client-orientation. So, for example,
the "ontology" that is specified by these services is really not at all
a deep hierarchy- it is really only a set of interfaces for which various
components may provide implementations. There is no conceptual reason
why it couldn't be a deeper hierarchy, we just never found it to be useful
to abstract things out to higher levels.

Dynamic services differ from static services in several important ways.
First, the interface for dynamic services is very generic, and totally
encapsulates the "semantics" of the service. There is a distinction between
"DynamicDataService" and "DynamicViewerService"- the former returns data,
the latter provides a visualization (i.e. a "Client"), but other than that,
dynamic services are almost totally opaque to the system in terms of
what exactly they are doing (although they are required to provide a
descriptive String so that the user gets a sense for what it is he/she is
invoking).

Second, the inputs and outputs of a dynamic service are similarly
opaque to the system; the notion of an ISYS data set will be discussed further
below, but for now it can be understood as a self-describing data structure,
similar in concept to an XML document, although implemented in terms of
memory-based data structures.

Third, the mechanism for "discovering" dynamic services is quite different
from the static service model. The latter assumes that the designer of the
component envisions the need for a service, and can ask for it in terms of an
interface that is suited to their preconceived idea of the utility of the
function. This is somewhat similar in concept to the notion of programming to
the API of a library, although instead of "hard-coding" the relationship to a
specific implementation, the implementation is dynamically associated with
the caller (imagine using a dynamically loaded library, but allowing the user
some control over the linking process). With dynamic services, on the other
hand, the designer of the component merely supplies a mechanism for
translating some portion of its internal data set (generally, the "selection
set") into the self-descriptive IsysObjectCollection data representation,
and a way for the user to trigger the "dynamic discovery" process (generally,
a right-click in the UI). When this occurs, ISYS simply passes around
references to the data set to each of the registered ServiceProviders, and
asks them to inspect the data and return the set of dynamic services that they
provide that could be used on the dataset. This inspection can be as simple
as looking for data of a certain type (e.g. identifiers in a certain namespace)
or more complicated (e.g. looking at the lengths of the sequences provided, or
the value of a species attribute). The main point is that the "service matching"
is totally encapsulated in the ServiceProvider, and does not depend on some
third party "matchmaker" like UDDI. Of course, the fact that we're only passing
around references to objects in memory makes this much easier than doing a
similar trick on the network, but one can imagine an analogous mechanism for
a MOBY-like system; for example, if a MOBY client simply sent out some simple
representation of what data types were present in its input set, that would
probably be sufficient for most providers to do a reasonable job of presenting
their relevant services. (Note that there may be some fuzzy ground here between
the notions of a "type" and a "value"; for example, if one uses the LSID
structure (as I understand it), the "namespace" is a property of the "value"
instead of the "type" of that data, but would probably be critical in matching
retrieval services to the data; another example would be a "sequence", for
which the "alphabet" used by the sequence could be encoded into a subtype or
simply viewed as a property of the sequence "value".)

At any rate, the main point I want to make is that this architecture
represents a kind of extreme of encapsulation, and is rather different in
spirit from the notions most people seem to have of MOBY-central (or UDDI in
the standard web services models) as being necessary as a matchmaker between
service requests and service descriptions. Though I certainly agree that
having services that are more self-descriptive will be valuable, I don't think
we should rule out the possibility of exploring alternative approaches to
the service brokering. For example, one could imagine "MOBY-Central" as
being nothing more than a registry of distributed "ServiceProviders"
(and probably the registry of the "ontologies/vocabularies" of data types
and service types or service descriptors). An entry in the ServiceProvider
registry could be something as simple as a URL to which one could POST
the representation of the datatype information, and which would return
a document with hyperlinks to URLs, some of which might simply be relevant
"GET" retrievals (though probably not unless id values were included in the
POST), others to which you could post the dataset for "execution" of the
service. (Note that I'm deliberately trying to paint this picture without
using SOAP/WSDL for the time being, although one could imagine using those
as well....)

We can explore the pros and cons of the various approaches in subsequent
discussions, but I hope this helps get people thinking in different
ways about the problem. One thing I would like to point out, however, is
that the different approaches to service representation/service discovery
are not necessarily mutually exclusive. For example, I have often found it
useful in ISYS to define a "static service", but to allow that same service
to be provided dynamically, by simply writing a little code that does a
reasonable translation from the "self-descriptive" representation of the
data to the representation prescribed by my own implementation. I'm beginning
to wonder if a more "self-descriptive" and finer-grained approach to
service-typing than the "fat interfaces with signatures" model might be useful
to bridge these alternative approaches (possibly similar to WSDL, but I need
to look at that more); for example, loosely constrained self-description of
services could be used to build up alternative hierarchies of classification,
depending on the "axes" of interest in a particular context (e.g. input vs
output). API-like interfaces seem to be easier for programmers to think about,
but many of their advantage in terms of compile-time type checking and
optimization don't seem to be particularly relevant in the context of a
distributed services model. I have the feeling that most of the function
that they provide in this context is more one of documentation/description,
and, to some extent, the notion of "contractual accountability", although
the latter is more of a convention than anything strictly enforceable. I
think it is worth thinking about different models of self-documentation that
might be more flexible and useful in a decentralized environment than those
provided by interface signature definition.

ISYS "Data Model":

The last thing I wanted to discuss was the approach to data modeling that
we have been using in ISYS. The best discussion of this can be found in
the Isys Developer's Manual, available as part of the ISYS SDK, at:
http://www.ncgr.org/isys/sdk-regist.html
Much of this document goes fairly deep into implementation issues for
ISYS developers, but the first sections are reasonably high-level
(the data model discussion is section 1.1.3). I won't try to reproduce the
arguments here (though I strongly recommend reading that section), but will
sketch out the basic approach.

The fundamental construct in the Isys data model is the "IsysAttribute",
which is really just the root of a dynamically growing "ontology" of
data-oriented interfaces (as in the case of services, we simply used a
package construct to serve as the unifying "registry" of "contributed"
attribute interfaces). These interfaces tend to be specified in terms
of the smallest unit of information that is potentially meaningful in
isolation, e.g. name, id, sequence, description.  They specify not only
"structure" (a String, a pair of Numbers, etc.) but also "semantics" (both
Name and SequenceText are Strings, but are obviously used in radically
different ways). They may be related to one another via inherirtance (e.g.
SequenceText extends "LinearObject", which is merely an abstraction of a
thing that has "length"); some of these inheritance relationships specify
refinements in structure (a SequenceText tells you both its length, and the
String representation of the sequence), while others are more like semantic
refinements (some Names are "GeneNames", but both are just Strings). There
is no limitation on the complexity of the interfaces, but we have found it
useful to adhere to the "interface segregation principle", and include in
each interface only enough information for it to be a self-coherent whole
(e.g. a start/end position on a sequence is specified by one interface
instead of two). It is also important to note that some attributes are
more like semantic markup tags (e.g. Name just "wraps" a String with
a semantic indication of its significance); others are more like semantically
meaningful indications of relationships between objects (e.g. a parent-child
relationship or the positional specification of one object like a
sequence feature onto a map).

At the next level up from this is the somewhat infamous (around here, anyway)
IsysObject, which simply wraps an arbitrary set of IsysAttributes to indicate
a level of "objective coherency", and provides querying mechanisms for
getting at attribute content of interest. I should point out that what we
were basically trying to achieve here was a somewhat more flexible way of
representing multiple inheritance of IsysAttribute interfaces than requiring
developers to statically define a class that would implement a specific set
of these IsysAttribute types, so in some sense, it's not really all that
much weirder than the notion of multiple interface inheritance. In fact, an
application that had already defined classes with definite data structures
could often just be used "as is", simply altering it to explicitly identify
which of the basic IsysAttribute types it was implementing. Understood
from this perspective, IsysObject is merely using "composition" rather than
inheritance to effect the same sort of thing in a more dynamic and flexible
way. This basic theme of
	dynamic:static::interpreted:compiled::flexible:efficient...
is really fundamental to software/systems design, and deciding the appropriate
level for your application is critical; needless to say, in ISYS we have
tended to the view that bioinformatics needs flexibility and dynamism, and
efficiency is somewhat secondary at this stage in the game.
("Premature optimization is the root of all evil"- Knuth)

Finally, there is the IsysObjectCollection, which is the unit of information
exchange between Isys components. It merely indicates a set of IsysObjects
without any necessary inter-relationship (usually, it's just a set of
things that the user has selected for reasons known only to her), and provides
similar "query facilities" to make navigation of the data content easier for
the consumer.

In terms of the MOBY project, it's reasonably clear that XML representations
are the way to go for representing data (although I think Lincoln has pointed
out in various discussions that tab-delimited structures may be just as
appropriate in cases where the data is flat and reasonably homogeneous in
content, as long as "self-descriptiveness" can be done in a standard way
at the level of the columns). The main questions I think we need to explore
have to do with the nature of the "data type ontology" that will be serving
as a common language. My own feeling is that it is much more
important (and more feasible) to develop a fine-grained vocabulary along the
lines of the IsysAttribute catalog than to try to get agreement on the correct
structure of higher-level objects. This seems to be somewhat at odds with the
picture that is presented in the moby_classes.txt file (available from the
biomoby cvs tree); I should do something similar
for the IsysAttribute hierarchy as it currently stands, but I want to clean it
up a bit, as I think it contains some needless complexity and certainly many
artifacts). I also think we should give consideration to the
"dynamic/compositional" style of growing more complex data types
(which is more natural to XML than it is to Class definitions).

...well, enough of my yakking! If you want to try ISYS, I would suggest getting
the version that is available for download from:
http://www.ncgr.org/pathdb/download.html

(This is a newer version of the system that includes some newer components,
as well as some platform enhancements; it is also easier to install than the
old distribution; we haven't updated the ISYS part of the website to reflect
these changes...)

There are links to some publications that discuss ISYS in relation to other
approaches to integration at:
http://www.ncgr.org/isys/architecture.html

I should note that these are not terribly up-to-date with respect to certain
aspects of the design of the system (in particular, our approach to data
modeling), but they should give a fairly good sense for what we were trying to
accomplish (as well as what we were not trying to solve)...

At any rate, I've already violated my intention not to overwhelm
your attention span. Please let me know if you have any questions, comments,
etc. Hope it helps...

Andrew Farmer
adf at ncgr.org
(505) 995-4464
Database Administrator/Software Developer
National Center for Genome Resources