[DAS2] query language description

Andrew Dalke dalke at dalkescientific.com
Thu Mar 16 05:17:24 UTC 2006


The query fields are

   name      |  takes | matches features ...
  ==========================
   xid       |  URI   | which have the given xid
   type      |  URI   | with the given type or subtype (XX keep this  
one???)
   exacttype |  URI   | with exactly the given type
   segment   |  URI   | on the given segment
   overlaps  | region | which overlap the given region
   inside    | region | which are contained inside the given region (XX  
needed??)
   contains  | region | which contain the given region  (XX needed?? )
   name      | string | with a name or alias which matches the given  
string
   prop-*    | string | with the property "*" matching the given string

Queries are form-urlencoded requests.  For example, if the features
query URL is 'http://biodas.org/features' and there is a segment named
'http://ncbi.org/human/Chr1' then the following is a request for all the
features on the first 10,000 bases of that segment

The query is for
     segment = 'http://ncbi.org/human/Chr1'
     overlaps = 0:10000

which is form-urlencoded as

    
http://biodas.org/features? 
segment=http%3A%2F%2Fncbi.org%2Fhuman%2FChr1;overlaps=0%3A1000

Multiple search terms with the same key are OR'ed together.  The  
following
searches for features containing the name or alias of either
BC048328 or BC015400

   http://biodas.org/features?name=BC048328;name=BC015400

Multiple search terms with different keys are AND'ed together,
but only after doing the OR search for each set of search terms with
identical keys.  The following searches for features which have
a name or alias of BC048328 or BC015400 and which are on the segment
http://ncbi.org/human/Chr1

    
http://biodas.org/features?name=BC048328; 
segment=http%3A%2F%2Fncbi.org%2Fhuman%2FChr1;name=BC015400

The order of the search terms in the query string does not affect
the results.

If any part of a complex feature (that is, one with parents
or parts) matches a search term then all of the parents and
parts are returned.  (XXX Gregg -- is this correct? XXX)


The fields which take URLs require exact matches.

I think we decided that there is no type inferencing done in
the server; it's a client side thing.  In that case the 'type'
field goes away.  We can still keep 'exacttype'.  The URI
used for the matching is the type uri, and NOT the ontology URI.

(We don't have an ontology URI yet, and when we do we can add
an 'ontology' query.)

The segment URI must accept the local identifier.  For
interoperability with other servers they must also accept the
equivalent global identifier, if there is one.

If range searches are given then one and only one segment is
allowed.  Multiple segments may be given, but then ranges are not
allowed.

The string searches support a simple search language.
     ABC  -- contains a word which exactly matches "ABC" (identity, not  
substring)
    *ABC  -- words ending in "ABC"
     ABC* -- words starting with "ABC"
    *ABC* -- words containing the substring "ABC"

If you want a field which exactly contains a '*' you're kinda
out of luck.  The interpretation of whitespace in the query or
in the search string is implementation dependent.  For that
matter, the meaning of "word" is implementation dependent.  (Is
*O'Malley* one word? *Lethbridge-Stewart*?)

When we looked into this last month at Sanger we verified that
all the databases could handle %substring% searches, which was
all that people there wanted.  The Affy people want searches for
exact word, prefix and suffix matches, as supported by the the
back-end databases.


   XXX CORRECT ME XXX

The 'name' search searches.... It used to search the 'name'
attribute and the 'alias' fields.  There is no 'name' now.  I
moved it to 'title'.  I think I did the wrong thing; it should
be 'name', but it's a name meant for people, not computers.

Some features (sub-parts) don't have human-readable names so
this field must be optional.


The "prop-*" is a search of the <PROP> elements.  Features may
have properties, like

    <PROP key="cellular_component" value="membrane" />

To do a string search for all 'membrane' cellular components,
construct the query key by taking  the string "prop-" and
appending the property key text ("cellular_component").  The
query value is the text to search for.

     prop-cellular_component=membrane

To search for any cellular_component containing the substring "mem"

     prop-cellular_component=*membrane*

The rules for multiple searches with the same key also apply to the
prop-* searches.  To search for all 'membrane' or 'nuclear'
cellular components, use two 'prop-cellular_component' terms, as

      
http://biodas.org/features?prop-cellular_component=membrane;prop- 
cellular_component=membrane


The range searches are defined with explicit start and end
coordinates.  The range syntax is in the form "start:end", for
example, "1:9".

Let 'min' be the smallest coordinate for a feature on a given
segment and 'max' be one larger than the largest coordinate.
These are the lower and upper founds for the feature.

An 'overlaps' search matches if and only if
    min < end AND max > start

XXX For GREG XXX

What do 'inside' and 'contains' do?  Can't we just get
away with 'excludes', which has complement of 'overlaps'?
Searches are done as:
   Step 0) specify the segment
   Step 1) do all the includes  (if none, match all features on segment)
   Step 2) do all the excludes, inverted (like an includes search)
   Step 3) only return features which are in Step 1 but not
       in Step 2)
   Step 4) ...
   Step 5) Profit!

I think this will support your smart code, and it's easy
enough to implement.

Every one but you was planning to use 'overlaps'.  Only you
wanted to use 'inside'.  Anyone want to use 'contains'?

					Andrew
					dalke at dalkescientific.com




More information about the DAS2 mailing list