[DAS] discussion document for das/2

Thomas Down td2@sanger.ac.uk
Thu, 6 Dec 2001 22:21:09 +0000


On Thu, Dec 06, 2001 at 03:42:42PM -0500, Lincoln Stein wrote:
>
> > This search by closest parent(s) is potentialy /very/ expensive, and
> > negates the use of ontology tearms. For example, we could define an
> > ad-hoc parent to Telomere & Clone, but the most derived node in your
> > hierachy that this can be cast to is PositionalFeature. The server would
> > end up pushing all features over when you just want Telomeers and Clones.
> 
> H'mm.  Let's take a use case.  An annotation source wants to publish a new 
> feature called "Type A telomeric repeat", but the ontology only has feature 
> types called "Telomere" and "Repeat".  I would like the client to be able to 
> request types of either Telomere or Repeat and get the "Type A telomeric 
> repeat" (among other things).  What I am thinking is that when the annotation 
> returns the table of contents for the types it serves (the "type list" 
> service), it explains to the client that "Type A telomeric repeat" belongs 
> under the "Telomere" and "Repeat" parents.  We could support the query in 
> either of two ways:
> 
> 	1)  the client lists the annotation types it wants to receive.   Since
> 		the client knows where the new types fit in under the
> 		hierarchy, it can ask for all the nodes under Telomere
> 		explicitly
> 
> 	2) the client asks for the more general node, such as Telomere, and
> 		the server, knowing its local ontology expands that to
> 		the list of specific terms and fetches them
> 
> I don't see that either of these operations is incredibly expensive.  Perhaps 
> there was something confusing in the way that I wrote this section?

Yes, you should certainly be able to ask for `telomere'
or `repeat', and get back telomeric repeats.  I think Matthew's
concern (and mine) was about the following wording:

| The DAS/2 protocol must allow annotators to
| make ad hoc insertions into the ontology should the feature they wish
| to describe not match exactly with any of the preexisting ones.  These
| should be indicated at query time by providing the identities of the
| closest parent(s) in the prebuilt ontology.

This, to me, implies that you're not actually allowed to ask
straight out for `type A telomeric repeats', even if the server
has specifically informed you of their existance.  This then
leaves extra filtering to be done client side, with unnessesary
data sent across the wire.

Am I mis-reading here?

> Well, we risk going around in semantic circles here.  You or I could put up a 
> relational database, make it available to the public, publish the schema, and 
> tell people to go ahead and query the server.  In fact, we do do that, and it 
> is very useful for querying single data sources.  But we want to have a 
> common data model, so that the same query templates will work with all 
> servers and will produce the same format results.  This requires that we 
> enumerate and name the queries, and enumerate and name the possible query 
> results.   If the services listed in the framework document can all be 
> expressed as DAML queries, then there is a direct mapping between the 
> services and the queries and we could indeed publish the service by 
> publishing the list of queries the server supports.
> 
> For my part, I'm happiest giving the services names.  This makes it easier 
> for us to think about the requirements for the project (e.g. efficient data 
> structures for providing the services), and how to divide up the work.


Sorry if this counts are circularity.  But yes, there are indeed RDBMS
offering this kind of data, and publically accessible.  However,
/they're all different/.  This in itself isn't actually a problem.
What makes these (almost) useless for anyone wanting to go out and
integrate data is that there's no machine-readable way of getting
any semantically useful information about the schema.

If people started putting up lots of RDBMS which either:

  - Use the same core schema (possibly with extension tables)

  - Or publish an annotated schema containing machine readable
    assertions of things like "the column clone.id encodes the
    public identifier of an entity of class Sequence" [obviously
    this requires that people first define terms like sequence...]

At this point, it would defintely be possible to build a working DAS.

I'm not actually suggesting that SQL is the ideal query language for
DAS.  It's got many dialects, and it doesn't really handle polymorphic
data terribly brilliantly.  I'd rather see something more object-
oriented.  XQuery might work.  [Thinking about it, DAS 1.0 maps
almost perfectly to a simple RDBMS, albeit with rather strict
rules for which queries and joins are allowed.]


But the way the RDBMS view is helpful is that there's really
just the one service: SELECT.  All the different types of data
you are handling are just different tables.  Now, I know that
the `many services' model is potentially just a different view
on the same thing, but one-service, multiple-schemas is a nice
view, in that it makes clear that the query mechanism should
be the same for everything.


Just a few ramblings, anyway.

Thanks for all your work on the document -- I hope it continues
to generate more discussion,

    Thomas.