[DAS2] query language description
Andrew Dalke
dalke at dalkescientific.com
Thu Mar 16 05:17:24 UTC 2006
The query fields are
name | takes | matches features ...
==========================
xid | URI | which have the given xid
type | URI | with the given type or subtype (XX keep this
one???)
exacttype | URI | with exactly the given type
segment | URI | on the given segment
overlaps | region | which overlap the given region
inside | region | which are contained inside the given region (XX
needed??)
contains | region | which contain the given region (XX needed?? )
name | string | with a name or alias which matches the given
string
prop-* | string | with the property "*" matching the given string
Queries are form-urlencoded requests. For example, if the features
query URL is 'http://biodas.org/features' and there is a segment named
'http://ncbi.org/human/Chr1' then the following is a request for all the
features on the first 10,000 bases of that segment
The query is for
segment = 'http://ncbi.org/human/Chr1'
overlaps = 0:10000
which is form-urlencoded as
http://biodas.org/features?
segment=http%3A%2F%2Fncbi.org%2Fhuman%2FChr1;overlaps=0%3A1000
Multiple search terms with the same key are OR'ed together. The
following
searches for features containing the name or alias of either
BC048328 or BC015400
http://biodas.org/features?name=BC048328;name=BC015400
Multiple search terms with different keys are AND'ed together,
but only after doing the OR search for each set of search terms with
identical keys. The following searches for features which have
a name or alias of BC048328 or BC015400 and which are on the segment
http://ncbi.org/human/Chr1
http://biodas.org/features?name=BC048328;
segment=http%3A%2F%2Fncbi.org%2Fhuman%2FChr1;name=BC015400
The order of the search terms in the query string does not affect
the results.
If any part of a complex feature (that is, one with parents
or parts) matches a search term then all of the parents and
parts are returned. (XXX Gregg -- is this correct? XXX)
The fields which take URLs require exact matches.
I think we decided that there is no type inferencing done in
the server; it's a client side thing. In that case the 'type'
field goes away. We can still keep 'exacttype'. The URI
used for the matching is the type uri, and NOT the ontology URI.
(We don't have an ontology URI yet, and when we do we can add
an 'ontology' query.)
The segment URI must accept the local identifier. For
interoperability with other servers they must also accept the
equivalent global identifier, if there is one.
If range searches are given then one and only one segment is
allowed. Multiple segments may be given, but then ranges are not
allowed.
The string searches support a simple search language.
ABC -- contains a word which exactly matches "ABC" (identity, not
substring)
*ABC -- words ending in "ABC"
ABC* -- words starting with "ABC"
*ABC* -- words containing the substring "ABC"
If you want a field which exactly contains a '*' you're kinda
out of luck. The interpretation of whitespace in the query or
in the search string is implementation dependent. For that
matter, the meaning of "word" is implementation dependent. (Is
*O'Malley* one word? *Lethbridge-Stewart*?)
When we looked into this last month at Sanger we verified that
all the databases could handle %substring% searches, which was
all that people there wanted. The Affy people want searches for
exact word, prefix and suffix matches, as supported by the the
back-end databases.
XXX CORRECT ME XXX
The 'name' search searches.... It used to search the 'name'
attribute and the 'alias' fields. There is no 'name' now. I
moved it to 'title'. I think I did the wrong thing; it should
be 'name', but it's a name meant for people, not computers.
Some features (sub-parts) don't have human-readable names so
this field must be optional.
The "prop-*" is a search of the <PROP> elements. Features may
have properties, like
<PROP key="cellular_component" value="membrane" />
To do a string search for all 'membrane' cellular components,
construct the query key by taking the string "prop-" and
appending the property key text ("cellular_component"). The
query value is the text to search for.
prop-cellular_component=membrane
To search for any cellular_component containing the substring "mem"
prop-cellular_component=*membrane*
The rules for multiple searches with the same key also apply to the
prop-* searches. To search for all 'membrane' or 'nuclear'
cellular components, use two 'prop-cellular_component' terms, as
http://biodas.org/features?prop-cellular_component=membrane;prop-
cellular_component=membrane
The range searches are defined with explicit start and end
coordinates. The range syntax is in the form "start:end", for
example, "1:9".
Let 'min' be the smallest coordinate for a feature on a given
segment and 'max' be one larger than the largest coordinate.
These are the lower and upper founds for the feature.
An 'overlaps' search matches if and only if
min < end AND max > start
XXX For GREG XXX
What do 'inside' and 'contains' do? Can't we just get
away with 'excludes', which has complement of 'overlaps'?
Searches are done as:
Step 0) specify the segment
Step 1) do all the includes (if none, match all features on segment)
Step 2) do all the excludes, inverted (like an includes search)
Step 3) only return features which are in Step 1 but not
in Step 2)
Step 4) ...
Step 5) Profit!
I think this will support your smart code, and it's easy
enough to implement.
Every one but you was planning to use 'overlaps'. Only you
wanted to use 'inside'. Anyone want to use 'contains'?
Andrew
dalke at dalkescientific.com
More information about the DAS2
mailing list