[DAS2] query language description

Thu Mar 16 20:50:37 UTC 2006

Hi Andrew

I presume one constraint is that you want to preserve standard CGI URL 
syntax? I think this is the best that can be done using that 
constraint, which is to say, fairly limited. This lacks one of the most 
important features of a real query language, composability. These 
ad-hoc constraint syntaxes have their uses but you'll eventually want 
to go beyond the limits and end up adding awkward extensions. Why not 
just forego the URL constraint and go with a composable extendable 
query language in the first place and save a lot of bother downstream?

On Mar 15, 2006, at 9:17 PM, Andrew Dalke wrote:

> The query fields are
>
>    name      |  takes | matches features ...
>   ==========================
>    xid       |  URI   | which have the given xid
>    type      |  URI   | with the given type or subtype (XX keep this
> one???)
>    exacttype |  URI   | with exactly the given type
>    segment   |  URI   | on the given segment
>    overlaps  | region | which overlap the given region
>    inside    | region | which are contained inside the given region (XX
> needed??)
>    contains  | region | which contain the given region  (XX needed?? )
>    name      | string | with a name or alias which matches the given
> string
>    prop-*    | string | with the property "*" matching the given string
>
> Queries are form-urlencoded requests.  For example, if the features
> query URL is 'http://biodas.org/features' and there is a segment named
> 'http://ncbi.org/human/Chr1' then the following is a request for all 
> the
> features on the first 10,000 bases of that segment
>
> The query is for
>      segment = 'http://ncbi.org/human/Chr1'
>      overlaps = 0:10000
>
> which is form-urlencoded as
>
>
> http://biodas.org/features?
> segment=http%3A%2F%2Fncbi.org%2Fhuman%2FChr1;overlaps=0%3A1000
>
> Multiple search terms with the same key are OR'ed together.  The
> following
> searches for features containing the name or alias of either
> BC048328 or BC015400
>
>    http://biodas.org/features?name=BC048328;name=BC015400
>
> Multiple search terms with different keys are AND'ed together,
> but only after doing the OR search for each set of search terms with
> identical keys.  The following searches for features which have
> a name or alias of BC048328 or BC015400 and which are on the segment
> http://ncbi.org/human/Chr1
>
>
> http://biodas.org/features?name=BC048328;
> segment=http%3A%2F%2Fncbi.org%2Fhuman%2FChr1;name=BC015400
>
> The order of the search terms in the query string does not affect
> the results.
>
> If any part of a complex feature (that is, one with parents
> or parts) matches a search term then all of the parents and
> parts are returned.  (XXX Gregg -- is this correct? XXX)
>
>
> The fields which take URLs require exact matches.
>
> I think we decided that there is no type inferencing done in
> the server; it's a client side thing.  In that case the 'type'
> field goes away.  We can still keep 'exacttype'.  The URI
> used for the matching is the type uri, and NOT the ontology URI.
>
> (We don't have an ontology URI yet, and when we do we can add
> an 'ontology' query.)
>
> The segment URI must accept the local identifier.  For
> interoperability with other servers they must also accept the
> equivalent global identifier, if there is one.
>
> If range searches are given then one and only one segment is
> allowed.  Multiple segments may be given, but then ranges are not
> allowed.
>
> The string searches support a simple search language.
>      ABC  -- contains a word which exactly matches "ABC" (identity, not
> substring)
>     *ABC  -- words ending in "ABC"
>      ABC* -- words starting with "ABC"
>     *ABC* -- words containing the substring "ABC"
>
> If you want a field which exactly contains a '*' you're kinda
> out of luck.  The interpretation of whitespace in the query or
> in the search string is implementation dependent.  For that
> matter, the meaning of "word" is implementation dependent.  (Is
> *O'Malley* one word? *Lethbridge-Stewart*?)
>
> When we looked into this last month at Sanger we verified that
> all the databases could handle %substring% searches, which was
> all that people there wanted.  The Affy people want searches for
> exact word, prefix and suffix matches, as supported by the the
> back-end databases.
>
>
>    XXX CORRECT ME XXX
>
> The 'name' search searches.... It used to search the 'name'
> attribute and the 'alias' fields.  There is no 'name' now.  I
> moved it to 'title'.  I think I did the wrong thing; it should
> be 'name', but it's a name meant for people, not computers.
>
> Some features (sub-parts) don't have human-readable names so
> this field must be optional.
>
>
> The "prop-*" is a search of the <PROP> elements.  Features may
> have properties, like
>
>     <PROP key="cellular_component" value="membrane" />
>
> To do a string search for all 'membrane' cellular components,
> construct the query key by taking  the string "prop-" and
> appending the property key text ("cellular_component").  The
> query value is the text to search for.
>
>      prop-cellular_component=membrane
>
> To search for any cellular_component containing the substring "mem"
>
>      prop-cellular_component=*membrane*
>
> The rules for multiple searches with the same key also apply to the
> prop-* searches.  To search for all 'membrane' or 'nuclear'
> cellular components, use two 'prop-cellular_component' terms, as
>
>
> http://biodas.org/features?prop-cellular_component=membrane;prop-
> cellular_component=membrane
>
>
> The range searches are defined with explicit start and end
> coordinates.  The range syntax is in the form "start:end", for
> example, "1:9".
>
> Let 'min' be the smallest coordinate for a feature on a given
> segment and 'max' be one larger than the largest coordinate.
> These are the lower and upper founds for the feature.
>
> An 'overlaps' search matches if and only if
>     min < end AND max > start
>
> XXX For GREG XXX
>
> What do 'inside' and 'contains' do?  Can't we just get
> away with 'excludes', which has complement of 'overlaps'?
> Searches are done as:
>    Step 0) specify the segment
>    Step 1) do all the includes  (if none, match all features on 
> segment)
>    Step 2) do all the excludes, inverted (like an includes search)
>    Step 3) only return features which are in Step 1 but not
>        in Step 2)
>    Step 4) ...
>    Step 5) Profit!
>
> I think this will support your smart code, and it's easy
> enough to implement.
>
> Every one but you was planning to use 'overlaps'.  Only you
> wanted to use 'inside'.  Anyone want to use 'contains'?
>
> 					Andrew
> 					dalke at dalkescientific.com
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2