[DAS] Re: das/2 proposal status

Tue Sep 21 21:25:19 EDT 2004

Hi Lincoln and others,

Lincoln:
> Here's the latest version of the "request" portion of the DAS/2 spec,
> recently converted into HTML.  I haven't proofread the HTML yet; any
> help you can render would be appreciated.

I've gone over the proposal.  Here are some of the things I've
noticed.

When going over the spec I'm trying to keep a few things in mind.

   - everything that can be a relative URL may instead be an absolute
      URL pointing to another machine

   - ReST requires that the architecture not depend on the actual
      URL hierarchy.  Eg, links aren't made by knowing to add "sequence/"
      to get the sequence data but instead is found as the result
      of a previous request

   - Reversing that, the program shouldn't figure out what a URL
      does by analyzing it.

That said, the hierarchical structure is fine.  It's meant both
for humans and as a way to minimize the size of links.  And except
in a few small (and fixable) places the spec doesn't restrict
the server to use a hierarchy.

 > The HTML fragment notation "#" is never used.

except in references to external URLs, like the GO link to
   http://song.sourceforge.net/ontologies/sofa#tRNA

 > In addition to the standard HTTP response headers, DAS servers
 > return the following HTTP headers:
 >	• 	 X-DAS-Version: DAS/2.0
 >	• 	 X-DAS-Status: XXX status code

How much of that can be moved into the standard HTTP
error codes, possibly with a parseable error message?

There are a few advantages to them.

  * You can tell if the server uses the DAS/1 or DAS/2 API.
An alternate solution is to put a version string in the
response from the server.  Perhaps <META version="DAS/2.0" />?

  * A client can dispatch on the status codes without
having to parse the payload.  But 1) clients need to
know how to handle other HTTP error codes (like 404
'Not Found', or 403 'Forbidden').  2) there's overlap
between some of the codes and HTTP codes -- shouldn't
HTTP error 400 'Bad Request' be sent when DAS-Status
of 4** is sent?  3) HTTP codes are more complete;
405 'Method Not Allowed' for when someone does a
PUT on a server that doesn't allow PUT, or 423 'Locked'
(from RFC 2518).

  * The Status codes are more specific than the HTTP
codes.  I think the answer there is to include a
more detailed error message in the HTTP payload.

My final reason is that it allows someone to implement
major portions of the DAS/2 interface with flat-files
served by a stock Apache.  That should help a lot with
making a test suite -- I could even use file URLs
and do without any web server.

 > http://server/das-genome
 >
 > List of data sources maintained by server "server."
 > The URL as a whole acts as a unique identifier
 > for this DAS/2 server.

This is limiting.  'server' usually means hostname, or
hostname + port.  I see no reason to prohibit the
main entry point from being

http://www.example.com/~dalke/my-servers/SantaFe

or limiting the connection to only http.  Why not
also allow, say, https?

https://www.example.com/secure/das-genome

One worry I see, btw, is the difference between

   http://server/das-genome
and
   http://server/das-genome/

I haven't been careful in checking all the uses of "/"
vs. no "/".  From what I've seen it's fine, in large
part because of the xml:base use.

 > Two formats are supported: a verbose XML format of
 > type application/x-das-source, and a compact ...

Later on in the editing I'll point out more examples
like this where the language is descriptive instead of
prescriptive.  Must servers implement both formats?
Or can they support neither and use a 3rd mechanism?

 >     xmlns="http://www.biodas.org/ns/das-genome/2.00"

Probably should be "2.0" to match the server version

 > <SOURCES>
 > ...
 > <SOURCE>
 > ...
 > <VERSION>

In general the spec doesn't say which fields are required
and which are optional.

Will we be using DTDs or some other schema for this?
In either case, based on the experience with the DAS/1
DTDs they didn't seem that useful.  I built my parser
on them and had to correct various typos in them.  My
parser was validating and it ended up failing when
used against servers with extensions.

 > The version column is any sequence of characters excluded
 > tab and newline

In general the word 'character' needs to be made more
specific.  I think you mean "printable ASCII character"
as compare to "Unicode character."

The restrictions on the source URL and source version
fields need to be propagated back to the XML names.
That is, it should be illegal to have

   <SOURCE id="vol%09vox" ...>

Also, in the XML the 'id' and 'version' fields are both
resolvable URLs relative to the xml:base.  You have

<SOURCE id="volvox" ...>
   <VERSION id="volvox/1" ...>

In the flatfile example you have

volvox	1	V. volvulus ...

That should likely be

volvox	volvox/1	V. volvulus ...

Because those can be arbitrary URLs the following should
be allowed

<SOURCE id="http://cshl.edu/das2/volvox" ...>
   <VERSION id="http://dalkescientific.com/das2/volvox/1" ...>

which would be written

http://cshl.edu/das2/volvox	http://dalkescientific.com/das2/volvox/1	V. 
volvulus ...

This exact case isn't likely but it should be allowed.

 >  By adding the version to the end of the path, the URL
 > becomes an identifier for the versioned data source. Retrieving

You many times use the language of string concatenation to
describe how to fully expand a URI in the context of a base url.
Since it may be an absolute URL, I ask there be some other
language instead.

However, I don't know what that word would be.

 > Fetching Information about Data Sources: The Sources Request

(backing up a bit)
 > As a special case, a version of 0 (numeric zero) selects the
 > current (most recent) version of sourceid. For this reason a
 > version of 0 is reserved.

How is a client supposed to know to use version 0?  It looks
like that's done by string concatenation to the URL, but as I
mentioned I don't like that approach.

I can think of two solutions.  1) add a new element like

<LATEST id="volvox/0" />

to the <SOURCES>.  2) add an attribute to the VERSION
element, like

   <VERSION id="volvox/2" latest="Y" description="...." />

Is the concept of "latest version" something that needs
to be named?

If the /0 URL is resolved, what does it do?  Is it a
redirect to the most recent version?

Must/should the list of versions in some order?  Like
from oldest to newest?  Should clients preserve the order
when showing it to users?

 > REQUEST:
 > http://www.wormbase.org/das-genome/volvox/2
 >
 >  RESPONSE:
 > Content-type: application/x-das-source-details
 >
 > <?xml version="1.0" standalone="no"?>
 > <!DOCTYPE DAS2DSNDETAILS SYSTEM 
"http://www.biodas.org/dtd/das2dsndetails.dtd">
 > <SOURCE
 >       xmlns="http://www.biodas.org/ns/das-genome/2.00"
 >       xmlns:xlink="http://www.w3.org/1999/xlink"
 >       xml:base="http://dev.wormbase.org/das"
 >       id="volvox"
 >       description="Volvox Example Database">
 >    <VERSION id="volvox/1"

The xml:base should be wormbase.org/das-genome.
The VERSION id should be "volvox/2"

 > <NAMESPACES>
 >   <NAMESPACE id="volvox/1/type">Feature types
 >      <FORMAT id="das2xml" type="application/x-das-types" />
 >      <FORMAT id="compact" type="application/x-das-compact-types" />
 >   </NAMESPACE>
 >   <NAMESPACE id="volvox/1/feature">A genomic feature
 >      <FORMAT id="das2xml" type="application/x-das-feature" />
 >      <FORMAT id="gff3"    type="application/gff3" />
 >      <FORMAT id="gtf"     type="application/gtf" />
 >      <FORMAT id="bed"     type="application/bed" />
 >   </NAMESPACE>

How does a client know what to do with each of these
namespaces?  Should it expect to get an
application/x-das-types from volvox/1/sequence?
Why or why not?

As written the only way to figure it out is to look at
the end of the URL, which I don't like.  I would rather
have the namespace content type stated as an attribute:

    <NAMESPACE id="volvox/1/feature" nstype="feature">...

  ('nstype' is an ugly name but 'type' is already
used for feature type and for content type).

What is the text of the <NAMESPACE> element used for?
That's the "A genomic feature" in the following

 >   <NAMESPACE id="volvox/1/feature">A genomic feature
 >      <FORMAT id="das2xml" type="application/x-das-feature" />

Is the following also allowed?

    <NAMESPACE id="volvox/1/feature">A genomic
       <FORMAT id="das2xml" type="application/x-das-feature" /> feature

It would be better, I think, to have that inside an
element or attribute, as

    <NAMESPACE id="volvox/1/feature" description="A genomic feature">
       <FORMAT id="das2xml" type="application/x-das-feature" />

 >  Dates should follow  the HTTP date specification.

RFC 2068 (HTTP/1.1) allows three different formats

    HTTP applications have historically allowed three different formats
    for the representation of date/time stamps:

           Sun, 06 Nov 1994 08:49:37 GMT  ; RFC 822, updated by RFC 1123
           Sunday, 06-Nov-94 08:49:37 GMT ; RFC 850, obsoleted by RFC 
1036
           Sun Nov  6 08:49:37 1994       ; ANSI C's asctime() format

    The first format is preferred as an Internet standard and represents
    a fixed-length subset of that defined by RFC 1123  (an update to RFC
    822).  The second format is in common use, but is based on the
    obsolete RFC 850 [12] date format and lacks a four-digit year.
    HTTP/1.1 clients and servers that parse the date value MUST accept
    all three formats (for compatibility with HTTP/1.0), though they MUST
    only generate the RFC 1123 format for representing HTTP-date values
    in header fields.

I would prefer the DAS spec be more specific about which
of those is allowed.  I think it's okay to say "RFC 1123 with
4 digit years".  We can pin this down later.

 > <METHOD>
 >   The id attribute within each <METHOD> tag corresponds to  an
 >   HTTP method, and is one of "GET," "PUT," "DELETE" or "POST."
 >   Clients can use this information to determine whether a  data
 >   source is updateable.

I don't know how needed this is.  Eg, a data source might be
editable but not by the person who fetched this data.  I suspect
this can't be fully figured out until the write interface is done.

 > <FORMAT>
 >   A data format recognized by this server. The id attribute is
 >   the short name of the format for use in the GET URL, and the
 >   type attribute is the returned document's MIME type.

That should probably be 'name' instead of 'id'.  For consistency's
sake since 'id' seems otherwise always used for resolvable URIs.

 > <TYPES
 >      xmlns="http://www.biodas.org/ns/das-genome/2.00"
 >      xmlns:xlink="http://www.w3.org/1999/xlink"
 >      xml:base="http://www.wormbase.org/das-genome/volvox/1/type/">
 >  <TYPE id="tRNAscan"
 >        ontology="http://song.sourceforge.net/ontologies/sofa#tRNA"
 >          source="tRNAscan-SE-1.11"
 >        xml:base="tRNAscan/">
?     <ATT id="glyph/">

There are two xml:base elements.  How is the ATT id resolved?
Is it resolved upwards through all the enclosing URLs?  That is,
   url = "glyph"
   for base in ["tRNAscan/",
                "http://www.wormbase.org/das-genome/volvox/1/type/",
                 ... URL used to fetch the document ... ]:
      url = urljoin(base, fragment)

    .. use 'url' to reference the glyph data ..

There's a typo -- replace &Lt; with <
 >  </TYPE>
 >  &Lt;TYPE id="curated_gene"

 >     <ATT id="glyph"     value="box" />
 >     <ATT id="bgcolor"   value="white" />
 >     <ATT id="fgcolor"   value="black" />
 >     <ATT id="key"       value="tRNAs" />
 >     <ATT id="citation"  value="tRNA predictions ..." />
 >     <ATT id="fontcolor" value="slateblue" />
 >     <ATT id="height"    value="3" />

At some point these need to be defined more formally.
How does a client app know what 'glyph' means, or what
"white" means?

Do these need to be individually named?  As written
these are resolvable as URLs.  It seems rather too
fine grained to me, and I like named items!

The problem comes down to how the software is
expected to know how to interpret a name.  There's
nothing in the protocol to say that "glyph" is
to be used as how to draw a given feature type.

It can be resolved in at least two ways.  One is
to add a datatype field to each of the attrs, where
the datatype comes from a controlled vocabulary.

The other is to drop the id scheme and just leave this
as a key/value table.  That means that individual
attributes of the feature type will not be fetchable.

OTOH, this can be left as is.  I don't think it's
that big a problem.  I can appease myself by saying
that there's a <METADATA> element which describes
the datatype of each id, and when not given it
defaults to http://www.biodas.org/specs/2.0/metadata
which defines things properly.  ;)

 > Fetching Information About Sequences: The Sequence Request
 >
 > Appending "dna" to the end of a versioned data source URL
 > addresses the raw sequence data. Fetching this URL
 > returns a FASTA file containing all the sequences known
 > to the data source:
 >
 > REQUEST:
 > http://www.wormbase.org/das-genome/volvox/1/sequence

("append" is another string concatenation operation ...)

The text says to append "dna" but the example uses
"sequence".

The Content-Type is "application/fasta".  Shouldn't
that be "x-fasta"?

Is there any way to get a list of sequence ids?  I had
assumed .../1/sequence would return a document listing
all of them, but it appears to return a FASTA file instead.

 > Ranges have the following format:
 >   seqid/min:max:strand

Are the following allowed?

   Chr1/::-1  -- reverse complement of all of Chr1

   Chr1/1000: -- Chr1 from 1000 to the end
      (I would rather use this than Chr1/1000 because to
       me that look like asking for the base at position 1000)

   Chr/1000::-1 -- reverse complement of Chr1 from 1000
       to the end

   Chr1/:: -- The entire sequence named Chr1

Is there a difference between
   Chr1
   Chr1/
   Chr1/:
   Chr1/::
   Chr1/::0

More specifically, which mean "on both strands" and which mean
"unknown strand"?

That's what I've managed to review in the last 3.5 hours.  I
still have another 9 pages to go, leaving off with the
"Fetching Information About Features" section.

But I should take a nap now so I can be coherent at 3am
for the conference call  :)

					Andrew
					dalke at dalkescientific.com