[DAS] [Fwd: Re: Writeback implementation]
Andy Jenkinson
andy.jenkinson at ebi.ac.uk
Wed Oct 29 22:15:08 UTC 2008
Gregg Helt wrote:
>
> Sorry for being imprecise about URIs, what I meant to say was that every
> feature in DAS/2.0 has a unique _absolute_ URI. Most IDs can be treated
> as relative URIs but not absolute URIs, and referring to relative URIs
> is not particularly useful outside their context.
By relative URI do you mean URN (e.g. SO:12345)? As opposed to the HTML
definition (e.g. index.html). URNs are still useful since they allow us
to solve this issue of identifying things that are the same. A
resolvable URI (i.e. a URL) is undoubtedly "better", but this is
semantic web territory and I'm not convinced it is necessary for DAS.
Certainly I think it would be too much a constraint to layer onto the
existing spec in one increment. In fact even using URNs is not easy for
everything - segment IDs cannot have colons.
> Furthermore technically not all arbitrary ID strings can actually be
> relative URIs either. I thought this was mostly a theoretical issue
> until my Trellis/Ivy DAS1-->DAS2 proxy choked on such a case on only the
> third DAS1 data source I was testing,
> http://www.ebi.ac.uk/das-srv/genomicdas/das/batman_CD4. It returns
> features that derive their IDs from their genomic location, like
> "21:26029715,26029814". Which can't be any form of URI, because
> according to the URI syntax spec <http://tools.ietf.org/html/rfc3986>
> the appearance of the colon before any forward slash means the "21"
> should be treated as the URI scheme, but the scheme can't have a digit
> as the first character. This isn't just a rare instance either -- I
> count at least sixteen data sources like this (probably more) on
> ProServer servers for the latest human genome assembly alone.
In this case, the ID is the least verbose but still unique-to-the-server
ID possible, used because the annotation has no natural identifier (the
source has per-base annotations). Believe me, there are far worse
implementations - some servers don't even try to generate a unique ID
for this kind of data. Leaving it blank is something that can be
rejected in validation, but it's very difficult to verify it's actually
unique...
There is nothing wrong with this particular example w.r.t the 1.53 spec,
since the spec says nothing about IDs having to be URIs, it simply says
they must uniquely identify the feature on the server. But you have hit
upon one of the reasons _resolvable_ URIs (i.e. URLs) will be difficult
to implement - annotations that have no natural identifier such as those
in the batman_CD4 source. Plus, having a unique identifier for every
base in a genome for every experiment it appears in is always going to
be verbose.
> On a side
> note, I'm not sure if these IDs are legal DAS1.53 feature IDs either,
> since many of them will not be unique within their DAS server, and
> depeding on how you interpret the 1.53 spec the colon may not be a legal
> ID character.
I don't think there's a problem with the colon - this is an illegal
character for reference IDs but not for feature IDs as far as I can see.
> The Trellis/Ivy proxy now deals with these cases, but checking each ID
> to see if it's a legal URI, and figuring out what to do if it's not, is
> definitely adding some performance overhead to the proxy.
>
> This also points to the need for better validation of server responses,
> preferably as enhancements to the validation that the DAS1 registry
> already does. I doubt if the current DAS2 validator would catch these
> kinds of things either.
If you can give specific examples of things that could be targets for
validation, I believe Jonathan will add them to his list so he can
implement them... :)
More information about the DAS
mailing list