[DAS] DAS1.6: coordinate systems

Andy Jenkinson andy.jenkinson at ebi.ac.uk
Thu Aug 12 09:30:08 UTC 2010


Hi Thomas,

Uh-oh, URIs... :)

For coordinate systems, I think the definitions of the component pieces are fairly well described. It is a pity that the species name is not given its own parameter though. The sources documentation then says: "The uri (required) attribute is a globally unique identifier for the coordinate system. It should be a fully resolvable URL providing more information about the coordinate system." This could be misleading as although the URIs _are_ resolvable, the content is not particularly machine friendly.

I am not willing to change the syntax of the coordinate system URIs out in the wild, but if you need the content returned to be machine readable we could replace the HTML content with an XML+XSLT combination. That is, "http://www.dasregistry.org/dasregistry/coordsys/CS_DS6" would look more like one of the entries in "http://www.dasregistry.org/das/coordinatesystem" to a machine, and the same as it currently does to a human. From a practical perspective though, if a client parses the XML elements from the registry's /das/coordinatesystem output, it can identify all the coordinate systems by both URI and text description. Changing the output wouldn't materially change what a client needs to do given either a URI or a comma separated string. It is always going to need to run a HTTP get and do some parsing of coordinatesystem XML. But it is certainly true that having the URI resolve to the XML is a more elegant and simple to explain system, and in any case the spec makes no mention of the fact that a client can even obtain the XML for all the coordinate systems together.

Throughout writing the 1.6 spec, URIs have always been a big problem to describe, mainly because there are lots of complications for DAS (source vs version, server vs registry namespace). URIs simply weren't given a lot of thought and explanation from the start, and it's too late to change them. In 1.6 things are a little better in that source URIs have been formalised and are more useful, without breaking the assumptions clients currently make. But I have changed the wording describing URIs a few times. I did have a large section describing URIs in general, the rules for formulating them, relative URI references etc, but in the latest drafts this is simplified so as not to confuse people (as much). It only really refers to source URIs rather than coordinate systems though, so I'm happy to add something. Could you please provide the wording and examples? Nobody ever seems to want to :)

With regards to the alignment command specifically, I wanted to use the URI for both the query and the content as they are more robust, but there was some practical reason for the existing servers that prevented us from doing so. Perhaps Rob or Andreas can comment? Again, technically it doesn't matter to the client if it has access to the coordinates XML, but it does make the spec not 'feel right' IMO. Also, if coordinate system descriptions (i.e. the comma separated string) were to change over time servers would drift and this would cause big problems for the client, but in truth plenty of stuff would break if that were to happen.

Cheers,
Andy

On 11 Aug 2010, at 20:14, Thomas Down wrote:

> My reading of the current spec is a bit vague about how we should refer to
> coordinate systems.
> 
> There seem to be three ways to represent a CS:
> 
>              - Comma-separated list, e.g. NCBI_36,Chromosome,Homo sapiens
>              - URI, e.g.
> http://www.dasregistry.org/dasregistry/coordsys/CS_DS40
>              - XML, e.g.:
> 
>                                <COORDINATES uri="
> http://www.dasregistry.org/dasregistry/coordsys/CS_DS40" taxid="9606"
> source="Chromosome" authority="NCBI" test_range="1:1,1000"
> version="36">NCBI_36,Chromosome,Homo sapiens</COORDINATES>
> 
> The XML representation seems to be the most complete.
> 
> The URIs don't really get discussed much in the spec.  Should they resolve
> to anything in particular?  Or should they just be treated as opaque
> strings?  The example I've given resolves to an HTML document with a
> Vitruvian Man icon and some human-readable details, but probably isn't going
> to be any help to a client.
> 
> If you restrict yourself to single-genome DAS (sequence, features, etc.),
> this all works out fine -- the only interaction you need with the coordinate
> system infrastructure is to filter out suitable sources from a registry, and
> in that case you can either filter on the XML COORDINATES elements -- which
> is fairly straightforward -- or you can ask the registry to filter for you
> (using a data model which is a reasonably close match to the XML).
> 
> However, working with coordinate systems seems to be pretty much essential
> once you start working with alignements, and this is where things start to
> get complex.
> 
> The returned alignment XML defines the CS of each sequence in the alignment
> using the comma-separated form.  My assumption is that you're meant to treat
> this as an opaque string and correlate it with data from a registry, but
> this isn't 100% clear.
> 
> On the other hand, if you want to specify a coordinate system in the
> alignment QUERY, you're supposed to provide a URI.  It's not at all clear to
> me what a server is supposed to be doing with this.  Again, opaque string?
> 
> Is it too late to ask if there's any chance of rationalizing this (and maybe
> providing a few concrete examples in the spec) before 1.6-final?
> 
>             Thomas.
> _______________________________________________
> DAS mailing list
> DAS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das





More information about the DAS mailing list