[DAS2] URIs for sequence identifiers

Mon Mar 13 23:45:04 UTC 2006

Proposals:
   - do not use segment "name" as an identifier
       - rename it "title" (human readable only)
       - allow a new optional "alias-of" attribute which is the
            link to the primary identifier for this segment

   - change the feature location to use the segment uri

   - change the feature filter range searches so there is a new "segment"
      keyword and so the "includes", "overlaps", etc. only work on
      the given segment, as
         segment=<uri>
         inside=$start:$stop
         overlaps=$start:$stop
         contains=$start:$stop
         identical=$start:$stop

   - If 'includes', 'overlaps', etc. are given then the 'segment'
       must be given (do we need this restriction?  It doesn't make
        sense to me to ask for "annotations on 1000 to 2000 of anything"

   - only allow at most one each of includes, overlaps,
       contains, or identical (do we need this restriction?)

   - multiple segments may be given, but then range searches
       are not supported (do we need this restriction?)

Discussion:

The discussion on this side of things was based on today's phone
conference.  Andreas needs data sources to work on multiple
coordinate spaces.

To quote from Andreas:
> There are several servers that understand more than one coordinate
> system and can return the same type of data in different coordinates.  
> (depending on which type of accession code/range was used for the
> request ) E.g. there are a couple of zebrafish servers that speak
> both  in Chromosome and Scaffold coordinates. (reason perhaps
> being that zebrafish is an organism that seems to be very difficult
> to assemble ?)

The current DAS system does not support this because of how
it does segment identifiers.

The current scheme looks like this:

<!-- sources.xml -->
<SOURCES ...>
   <SOURCE ...>
    <VERSION ...>
      <COORDINATES authority="Andreas" source="Scaffold" ... />
      <COORDINATES authority="Andreas" source="Chromosome" ... />
      <CAPABILITY type="segments" query_id="http://sanger/andreas/" />
      ....

Problem #1: We need two entry points, one to view the segments
in Scaffold space, the other to view them in Chromosome space.

Solution #1 (don't like it though).
Add a "source=" attribute to the CAPABILITY and allow multiple
segments capabilities

<!-- sources.xml -->
<SOURCES ...>
  <SOURCE ...>
   <VERSION ...>
    <COORDINATES authority="Andreas" source="Scaffold" ... />
    <COORDINATES authority="Andreas" source="Chromosome" ... />
    <CAPABILITY type="segments"
       query_id="http://sanger/andreas/scaffolds.xml" source="Scaffold"  
/>
    <CAPABILITY type="segments"
       query_id="http://sanger/andreas/chromosomes.xml"  
source="Chromosome" />
     ....

I don't like it because it feels like the COORDINATES and
CAPABILITY[type="segments"] field should be merged.  Still, I'll
go with it for now.

Problem #2: feature searches return features from either namespace

Consider search for name=*ABC* (that is, "ABC" as a substring in
the "name" or "alias" fields).  Then the result might be

<FEATURES>
   <FEATURE id="F0001" type_id="T0001">
     <LOC segment="A/100:200" />
   </FEATURE>
</FEATURES>

Where "A" is a short-hand notation for one of the segments?
Which one?  The client goes to the segment servers:

Query: http://sanger/andreas/scaffolds.xml"
Response:
<SEGMENTS>
  <SEGMENT id="http://whatever.com/ChromosomeA" name="A" length="2000" />
</SEGMENTS>

Query: http://sanger/andreas/chromosomes.xml"
<SEGMENTS>
  <SEGMENT id="http://whatever.com/ScaffoldA" name="A" length="2000" />
</SEGMENTS>

The segment name "A" matches either ChromosomeA or ScaffoldA, and
there's no way to figure out which is correct!

This comes because our own naming scheme is not very good at
being globally unique.  We could fix it by also stating the
namespace in the result, as

<FEATURES>
   <FEATURE id="F0001" type_id="T0001">
     <LOC segment="A/100:200" source="Scaffold"/>
   </FEATURE>
</FEATURES>

Gregg asked "why don't we just use the URI"?

After a long discussion we decided to propose just that.
That is, get rid of the "name" attribute.  Instead, use a
"title" attribute which is human readable and an optional
"alias-of" which contains is the primary identifier for
the given segment.

The alias-of value is determined by the person who
defined the COORDINATES.  It could be a URL.  It could
a URI.  It does not need to be resolvable (though it
should - perhaps to a human readable document?  Or to
something which lists all known aliases to it?)

That is, the segments document will look like this

Query: http://sanger/andreas/scaffolds.xml"
Response:
<SEGMENTS>
  <SEGMENT uri="http://whatever.com/ChromosomeA" length="2000"
     alias-of="http://www.ncbi.nlm.nih.gov/human/v32/Chromosome/A"
     title="Chromosome A" />
</SEGMENTS>

Query: http://sanger/andreas/chromosomes.xml"
<SEGMENTS>
  <SEGMENT uri="http://whatever.com/ScaffoldA" length="2000"
     alias-of="http://www.ncbi.nlm.nih.gov/human/v32/Scaffold/A"
     title="Scaffold A" />
</SEGMENTS>

This has a few implications.  Feature locations must be given
with respect to the segment uri, as

<FEATURES>
   <FEATURE id="F0001" type_id="T0001">
     <LOC segment_uri="http://whatever.com/ScaffoldA" range="200:300"/>
   </FEATURE>
</FEATURES>

Given this segment_uri a client can figure out if it is in
Scaffold or Chromosome space because it can check all of the
URIs in each space for a match.

The other change is in range searches.  Consider the current
scheme, which looks like

   includes=ChrA
   includes=A/100:300

The query is of the form $ID or $ID/$start:$end.  It needs to be
changed to support URLs.  For examples,

   includes={http://www.whatever.com/ChromosomeA
   includes={http://www.whatever.com/ScaffoldA}/100:300

We couldn't come up with a better syntax.  Then Gregg asked
"why do we need multiple includes"?

That is, the current syntax supports
   includes=ChrA/0:1000;includes=ChrB/2000:3000;includes=ChrC/5000:6000

to mean "anywhere on the first 1000 bases of ChrA, the 3rd 1000
bases of ChrB, or the 6th 1000 bases of ChrC".

Given the query language, we're looking for way to write that
using URLs, as

includes={http://www.whatever.com/ChromosomeA}0:1000;includes={http:// 
www.whatever.com/ChromosomeB}:2000:3000;includes={http:// 
www.whatever.com/ChromosomeC}:5000:6000;

However, that's a very unlikely query.  What if we split the
"includes", "overlaps", etc. into "includes_segment" and  
"includes_range".
In that case:

   old-style:
includes=A/500:600
   new-style:
includes_segment=http://www.whatever.com/ChromosomeA; 
includes_range=500:600

   old-style:
includes=A/500:600,Chr3/700:800
   new-style:
includes_segment=http://www.whatever.com/ChromosomeA; 
includes_range=500:600;
includes_range=700:800

   old-style:
includes=A/500:600,D/700:800
   new-style: -- NOT POSSIBLE

   old-style:
includes=A/500:600,D/500:600
   new-style: (not likely to be used in real life)
includes_segment=http://www.whatever.com/ChromosomeA; 
includes_segment=http://www.whatever.com/ChromosomeD; 
includes_range=500:600;

This no longer allows searches with subranges from different segments.

The again -- who cares?  Those sorts of searches are strange.

Talking some more.  Who needs the ability to do more than one
includes / overlaps / etc. query at a time?  Gregg wants the
ability to do a combination of includes and overlaps, but
that's all.

We can simplify the server code by only supporting one
inside search, one contains search, and/or one overlaps
search, instead of the current system which allows a more
constructive geometry, and we can move the segment id out
into its own parameter.

Allen said that that would prevent more complicated types
of analysis on the server, but that anyone doing more
complicated searches would pull the data down locally.

Does anyone want to do more than one overlaps search at
at time?  More than one contains search at a time?  More
than one identical search at a time?

(For that matter, does anyone actually want to do a "identical"
search?  Gregg thinks it will be useful to find any other
annotations which are exactly matching the given range.
I think that might be better with a "include"/"exclude" combination
to have start/end positions within a couple of bases from
the specified range.)

PROPOSAL:
   Change the range query language to have

segment=  <<the url of the segment to search>
inside= $start:$end
overlaps= $start:$end
contains= $start:$end

Example:

segment=http://whatever.com/ChromosomeD;inside=5000:6000

Also, only allow at most one includes, one overlaps, and
one contains (unless people want it).  I'm less sure about
the need for this restriction.  It might be as easy to
implement the more complex search as it would be to check
for the error cases.

					Andrew
					dalke at dalkescientific.com