[DAS2] query_api and server layout

Tue Feb 7 16:50:56 UTC 2006

Continuing from yesterday's discussion...

There are several things in a DAS server

- there is the list of all sources and versions
- there is a list of all versions for a source
- there is the versioned source information

The versioned source only really provides a bit of
overall configuration information and links to three URLs:

   - the query interface for features
   - the query interface for types
   - the query interface for segments

It doesn't say anything about where the actual feature,
type and segment data is stored.  It doesn't even mean
that the query URLs are on the same machine as the versioned
source document.  Hence Andreas can have his registry server.

DAS defines what those queries do.  The segments query URL
interface can be a shared reference server.  It has a
rather simple interface:
   - get URLs and information for each segment
       - given a sequence URL return the sequence data
   - return the assembly data

The segment and sequence data does not need to be on the
same machine as the segments query URL.  It likely will
be but does not need to be.

DAS defines what the types interface does.  At present it
is also very simple.  Be default it lists everything, or
you can ask it for an "ontology" or (proposed new query)
"exact_ontology", and it returns all DAS types which match
that request.

The actual DAS type data does not need to be on the same
server has the DAS query URL, though again it probably will
be.  The types query URL does not need to be on the same machine
as the segments query URL.

Similarly, the features query URL implements the DAS query
interface and returns a list of features.  The actual features
do not need to be on the same machine or directory location
as the feature query, or the types, or the segments.

Here are some possible reasons for the different locations:

Common case:
   - segments query URL and segments data on a reference server
   - versioned source provides its own types and features

New genome / internal project:
   - database implements all three query URLs

Registry server:
   - each versioned source entry points to the original machine's
       values for the segments, types and features query URLs

Multiple versions database, shared types:
   - segments points to the reference server
   - all versioned sources "types" query url point to the same URL
   - each versioned source gets it own features query

old-style CGI-based web server:
   - the "segments" query url points to the reference server
   - the individual features, types and sources are ".xml" files
       in the file system
   - the query URLs end with ".cgi" and start a CGI script

If we say that the URL for doing a types query is composed as:
   <the versioned source URL> + "/" (if missing) + "types"

then at the very least we preclude CGI-based servers.  No big
deal perhaps?  It also makes things slightly more duplicitous
when several versions of the database share the same DAS "types"
(and "segments").

I also think using a server-provided URL is easier than constructing
the URL in code.  Get the "query_url", perhaps resolved by the
xml:base.  That's it.  No need to add in the "/types".

Gregg worries about the network performance of having
   <FEATURE type="../../type/AB123">
    <LOC id="http://some.other.server" range="300:400"/>
    <REGION id="feature/QW41414" />
   </FEATURE>

because each location has the full URL to another server and
the type in this case refers to a types collection shared
by all of the versions of the source.

I've thought about that for a while.  It's a reasonable and
serious architectural concern.  I think the right response
is that that's an architecture decision we should leave up to
the data provider.  If Gregg wants more compact XML and that
on-the-fly compression slows things down too much then his
DAS server can make the segments, types and features all be
not only on the same machine but in the same directory.

The following is valid (omitting some required parts)

<SOURCE>
   <VERSION id="/h_sapiens/v1/">
    <CAPABILITY type="features" query_id="/h_sapiens/v1/features" />
    <CAPABILITY type="types" query_id="/h_sapiens/v1/types" />
    <CAPABILITY type="segments" query_id="/h_sapiens/v1/segments" />
   </VERSION>
</SOURCE>

The features request can return

GET /h_sapiens/v1/features
<FEATURES xmlns:das="...">
  <FEATURE id="F12345" type="Tabcde">
    <LOC id="C1" range="32:34"/>
    <REGION id="F789" />
  </FEATURE>
</FEATURES>

In this architecture, features start with an 'F', like
   /h_sapiens/v1/F12345
types start with a 'T', like
   /h_sapiens/v1/Tabcde
and regions start with a 'C', like
   /h_sapiens/v1/S1

This is about as compact as I think you can make it, yet it
still fits into the current DAS spec.  (You don't even need
the special character - it only makes it easier to see that
the names/URLs will never collide.)

					Andrew
					dalke at dalkescientific.com