ids and URLs (was Re: [DAS2] Ontologies in DAS/2)

Wed Feb 8 16:36:11 UTC 2006

Dave Howorth wrote:
> I'm curious about the DAS use of id attributes, especially given an 
> expectation to use getElementById().
>
> DAS has attributes that are URLs - they include the '/' character.
>
> But getElementById() is an HTML or XHTML DOM method I believe.
>
> Both HTML 4 and XHTML require that id attributes be of type ID, I 
> think, and the ID type does not permit '/' characters (IDs are Names).
>
> I find it pretty confusing that DAS uses an attribute that is called 
> id that isn't an ID. And I'm curious to know if getElementById() works 
> with it? Sounds like a sloppy implementation of the DOM. Or did I miss 
> something?

We've been talking about this and related matters most of the
day.  It started with Thomas' question "How do I get all of the
exons in the database which are from Vega?"  (Vega being some
other database.)

All of the features which are exons from Vega have the same DAS
data type.  This means he wants to do a feature query with
type = <the DAS type id>

He needs to get the DAS type id.  He can get all of the exons
using an ontology search.  But he wants to search for the string
"exon".  Given the discussion yesterday, will the type query
support "ontology='exon'" or must he use some other service to
convert "exon" to "SO:exon" or to "http://some/server.url"?

Suppose for now it is "SO:exon".  He does

     http://das.server/../types?ontology=SO:exon

That gets all of the exon types, but not the ones from Vega.
The Vega types have a source="Vega".  DAS type queries do
not support searching on that field.

PROPOSAL:  Add a "source=" (case-insensitive substring search)
field to the types query.  (I don't think there is any contention
here so I'll add it.)

     http://das.server/../types?ontology=SO:exon;source=Vega

That comes back with a single DAS type.

He now wants to search for all features with that type.  What
does he use for the query?  Is it (assuming proper escaping)

    http://das.server/../features?type=http://das.server/../type/T12345

?  That's rather excessive, especially if there are many
DAS types derived from the given ontology term.

All around people want to use "T12345" for that, and not the full
URL.  Are there people who do want to use the full URL?

The current system comes from saying the URL is the identifier
for a DAS object.

If as Dave points out we have a "id" which is a simple string
(of the format /[A-Za-z0-9_]+/ or so) then there's no problem.
We can use that for the query, as

    http://das.server/../features?type=T12345

PROPOSAL: do not use a URL for the identifier for objects

That fixes a few problems:
   - xml:base is no longer an issue; these are ids and not URLs
   - the names are short and sweet

It introduces a few problems.

Problem 1: a feature has a type.  How can the client get from the
type id to the type information if there is no URL to resolve?

   Solution 1: add a 'id=' term to the types query URL, eg
      http://das.server/../types?id=T12345
   (or possibly call it 'type=')

   Solution 2: append "/" + type id to the types query URL, eg
     http://das.server/../types/T1234

   Solution 3: have both an 'id' and an 'href' attribute

   Solution 4: the client downloads all the types and compares
    the id fields.

QUESTION:
   At Hinxton nearly all the DAS servers have only one or two types.
Ensembl has 45 types and Allen's has about 50.  Is it reasonable
to have clients just go ahead and download everything and not
worry about a query language?  Is Chado any different?

Problem 2: a feature can refer to its parent and part features.
It can refer to regions on other features.  How does a client get
information about the feature given the feature id?

   Solution 1: add a 'id=' term to the features query URL
   Solution 2: append "/" + feature id to the feature query URL
   Solution 3: have both an 'id' and an 'href' attribute

We discussed this a lot and decided on

PROPOSAL: add an 'id=' query to the types and features query.

We decided against solution 2 because of me - I don't like
working with URLs that way.  Thomas pointed out that an 'id='
query is useful, eg, if a feature has three parts then a client
can request

    http://das.server/../features?id=part1,part2,part3
(NOTE: we're also thinking of proposing this syntax for an 'OR'
query over the same term
    http://das.server/../features?id=part1;id=part2;id=part3
)

I pointed out that having both means there are two ways in the
server to look-up by id - extra machinery.

QUESTION: Who will want to refer to features and types by URL?

Possibilities:
   - hypothetical model where the queries return a list of URLs and
the server (through HTTP pipelining) asks only for the ones it
doesn't have already; saving bandwidth.  THIS IS NOT A USE CASE!

   - request a feature in a specific format (but that can be done
       through the query URL)

   - RDF people who want individually named items (not a use case)

¡We couldn't come up with a case where someone would want to
refer to features and types as an individually named URL!

For segments there is a use case - you can ask for sequence by
range, and that's through the segment URLs.  However, that could
be done with the segment query URL so it's not a strong use case.
In any case, it hasn't been a problem so I'll put that off for now.

That being the case, there's no need to consider "Solution 2".
Why have URLs if no one wants to use them?

What did come up during the discussion here was that we had
planned to use URLs for writeback.  That model seems rather
nice.  "DELETE" and "PUT" to the correct URLs, rather than
going through a "POST to delete.cgi?type_id=", etc.

The model for writeback was something like "ask server to make
a copy, with region A:C available for editing.  User works
with region.  User commits region back to server."

In that case, the request for region might as easily make a
copy of the source, available through a special URL visible
only to that one user.  In this copy it can expose "url="
attributes for editing, perhaps also with a "writeable=" field
because some features will not be editable for that user.

I complained yesterday about "writeable" but that was because
for the general purpose server the concept of "writeable" was
user-specific and not appropriate.  In this writeback model
it's just fine.

Another thing came up during discussion of this.  Roy yesterday
proposed the idea of a simple server which only supports getting
"everything".  It doesn't support the DAS query specification.
That is, it only supports

   http://das.server/../types
   http://das.server/../features

and fetching those returns everything.  This is useful for small
data sets because those could be simple files, like

   http://das.server/../types.xml
   http://das.server/../features.xml

Still, for that case there would need to be "feature/F1", "type/T2",
etc.  In essense, a duplicate of every record.

Last December during discussion people said there was no use
case for this sort of flat-file oriented server.  This was not
a design goal.

Thomas mentioned that there is a use case.  Uploading of DAS
tracks to a server.  People complain now that it's hard to
do that.  With this url-less model people can upload a small
number of documents (or at .zip file of a directory) with
the versioned source, types, and features data.

<!-- this is "sources.xml" -->
<VERSION>
   <COORDINATES ... />
   <CAPABILITY type="types" query_url="types.xml">
     <FORMAT name="das2xml">
     <SUPPORTS name="all" />
   </CAPABILITY>
   <CAPABILITY type="features" query_url="features.xml">
     <FORMAT name="das2xml">
     <SUPPORTS name="all" />
   </CAPABILITY>
</VERSION>

<!-- this is features.xml -->
<FEATURES>
</FEATURES>

<!-- this is types.xml  -->
<TYPES>
</TYPES>

There is no need to have an "exploded" copy of all of the
records in parallel to the types and features xml files.

Big Advantage:

Stylesheets are much easier to write.  Refer to fields by
short id instead of long URL.

Conclusion:
   Proposal 1: "id"s are of the form /[A-Za-z0-9_]+/
   Proposal 2: FEATURE and TYPE elements have an option "url"
             (or "href") attribute
   Proposal 3: the feature and type queries support a 'id=' search
   Proposal 4: the type query supports a "source=" search

Churn factor:
   Allen's server doesn't need the 'type/' and 'feature/' fields
   Gregg and others don't need to worry about xml:base any more.
   Type and feature lookups need to track the query URL as well
     as the type and feature id
   We need a new 'id=' search capability

These don't seem big on a programming sense, more a conceptual one.

					Andrew
					dalke at dalkescientific.com