[DAS2] Coordinates and sequence URIs
Gregg Helt
gregghelt at gmail.com
Thu Nov 13 07:39:18 UTC 2008
On Thu, Oct 30, 2008 at 2:01 PM, Garret Wilson <garret at globalmentor.com>wrote:
> ...
> This brings up a related issue regarding the assembly and sequence URIs at
> http://www.biodas.org/wiki/GlobalSeqIDs . Before on this list I've brought
> up the issue of whether DAS has authority to maintain identifiers in
> namespaces from domains controlled by third parties (i.e. NCBI). This still
> worries me.
>
> How confident can we be that the DAS GlobalSeqIDs are stable and will not
> change for a while?
The GlobalSeqIDs were created because at the time there were no stable URIs
for genome assemblies and assembly sequences from authoritative sources like
NCBI. As far as I know that's still the case, though since then there's
been some movement towards stable URIs at NCBI (see
here<http://lists.w3.org/Archives/Public/public-semweb-lifesci/2007Feb/0123.html>)
and other authoritative sources. Also at the time the GlobalSeqIDs
were
created the DAS registry used IDs for coordinates but not URIs.
But now the DAS registry uses the DAS1.53E/2.0 "sources" document, so every
COORDINATES entry has a URI. For example:
http://www.dasregistry.org/coordsys/CS_DS40 is the registry coordinates URI
corresponding to the GlobalSeqID URI
http://www.ncbi.nlm.nih.gov/genome/H_sapiens/B36.1/ . Given that we have a
central DAS registry I do think it makes sense that maintaining stable URIs
for sequences and assemblies (and other collections of sequences) be handled
in the registry -- at least when there's no stable URIs from an
authoritative source. I think there are better ways to assign URIs then
either the way currently used in the DAS1 registry (very opaque) or the DAS2
GlobalSeqIDs (transparent but encroaching on NCBI namespace), but the more
important point is that we should only have one strategy for all versions of
DAS.
We discussed this back in 2006/2007, and I know Andreas Prlic joined in on
several teleconference conversations about merging the DAS2 notion of global
seq and assembly IDs into the DAS registry and "sources" doc coordinates
elements.
Secondly, related to URI resolution, I note that I cannot take an assembly
> URI such as http://www.ncbi.nlm.nih.gov/genome/H_sapiens/B36.1/ and simply
> resolve the chromosome ID (e.g. chr1) against it to form the sequence URI.
> My application instead has to have specific knowledge of this particular
> assembly namespace, knowing that it must first append the path segment
> "dna/" to the URI, yielding
> http://www.ncbi.nlm.nih.gov/genome/H_sapiens/B36.1/dna/chr1 .
>
> I'd rather my application, once it knew the assembly URI, simply need to
> resolve the chromosome ID to the assembly URI to determine the sequence URI,
> such as http://www.ncbi.nlm.nih.gov/genome/H_sapiens/B36.1/chr1 .
>
> Garret
This illustrates one weakness of the current DAS sources XML -- given the
coordinates URI, there is still no ability to directly determine
authoritative/reference sequence URIs for those coordinates. These sequence
URIs can't be reliably inferred from the coordinates URIs, and I don't think
they should be inferred (or constructed) at all.
Attempts to infer sequence URIs currently lead to all sorts of trouble, as
I've found in working on the Trellis/Ivy DAS1-->DAS2 proxy. For example the
proxy assumes that if versioned source V1 has coordinates C and entry_points
capability E1, then E1 describes the segments available for coordinates C.
Based on this assumption if versioned source V2 also has coordinates C but
doesn't have an entry_points capability then the proxy uses E1 from V1
instead since the versioned sources share the same coordinates. Which
sometimes works but not always -- what happens if versioned source V3 also
has coordinates C but has an entry_points capability E3 that disagrees with
E1?
I'm seeing the above situation in the DAS1 registry -- for example, for
coordinates .../CS_DS40 (NCBI human genome assembly v.36) which has 44
different versioned sources in the registry. 2 of these versioned sources
have entry_points capabilities:
A) http://hgwdev-gencode.cse.ucsc.edu/cgi-bin/das/hg18/entry_points
B) http://www.snpbox.org/cgi-box/das/SNPbox_human_44_36f/entry_points
However, these entry_points queries don't return the same thing. They agree
on naming for chromosome IDs, but for non-chromosomal sequences the naming
starts varying, for instance "M" vs "MT" for the mitochondrial DNA. More
importantly, they disagree on the stop/length value for nearly every
chromosome!
So I think the sequence URIs should be specified -- given the coordinate
URIs and capability URIs of a versioned source, there should be a query
mechanism to return sequence info for the coordinate URI and this info
should include sequence URIs. As illustrated above both the DAS1
entry_points and DAS2 segments queries currently seem too disconnected from
the coordinates URIs without some changes to the sources XML. One would be
to add to the entry_points and/or segments capabilities of "authoritative"
versioned sources a coordinates attribute which would be a relative URI
reference to the coordinates for which they are the authoratative list of
sequences. This is actually in the RelaxNG schema for DAS2, but currently
commented out.
Gregg
More information about the DAS2
mailing list