[DAS] Coordinates and sequence URIs

Andy Jenkinson andy.jenkinson at ebi.ac.uk
Thu Nov 13 12:29:58 UTC 2008


Gregg Helt wrote:
> I think there are better ways to assign URIs then
> either the way currently used in the DAS1 registry (very opaque) or the DAS2
> GlobalSeqIDs (transparent but encroaching on NCBI namespace), but the more
> important point is that we should only have one strategy for all versions of
> DAS.

Currently DAS1 does not formally include URIs, should it do so we can 
improve how the registry handles them.

> 
> Attempts to infer sequence URIs currently lead to all sorts of trouble, as
> I've found in working on the Trellis/Ivy DAS1-->DAS2 proxy.  For example the
> proxy assumes that if versioned source V1 has coordinates C and entry_points
> capability E1, then E1 describes the segments available for coordinates C.
> Based on this assumption if versioned source V2 also has coordinates C but
> doesn't have an entry_points capability then the proxy uses E1 from V1
> instead since the versioned sources share the same coordinates.  Which
> sometimes works but not always -- what happens if versioned source V3 also
> has coordinates C but has an entry_points capability E3 that disagrees with
> E1?
> 
> I'm seeing the above situation in the DAS1 registry -- for example, for
> coordinates .../CS_DS40 (NCBI human genome assembly v.36) which has 44
> different versioned sources in the registry.  2 of these versioned sources
> have entry_points capabilities:
>     A) http://hgwdev-gencode.cse.ucsc.edu/cgi-bin/das/hg18/entry_points
>     B) http://www.snpbox.org/cgi-box/das/SNPbox_human_44_36f/entry_points
> However, these entry_points queries don't return the same thing.  They agree
> on  naming for chromosome IDs, but for non-chromosomal sequences the naming
> starts varying, for instance "M" vs "MT" for the mitochondrial DNA.  More
> importantly, they disagree on the stop/length value for nearly every
> chromosome!
> 
> So I think the sequence URIs should be specified -- given the coordinate
> URIs and capability URIs of a versioned source, there should be a query
> mechanism to return sequence info for the coordinate URI and this info
> should include sequence URIs.  As illustrated above both the DAS1
> entry_points and DAS2 segments queries currently seem too disconnected from
> the coordinates URIs without some changes to the sources XML.  One would be
> to add to the entry_points and/or segments capabilities of "authoritative"
> versioned sources a coordinates attribute which would be a relative URI
> reference to the coordinates for which they are the authoratative list of
> sequences.  This is actually in the RelaxNG schema for DAS2, but currently
> commented out.

Merging sequence info with sequence URIs won't work for UniProt, it's 
just too big.

We need to either make one source authoritative for a coordinate system, 
either in sources or coordsys documents, or have the registry validate 
coordinate system compliance. I'd suggest the latter because it allows 
for redundancy. Either way we need to make it a requirement that a 
coordinate system has at least one server providing segments and 
sequence, which is not currently the case.



More information about the DAS mailing list