[DAS2] feature filter examples
Andrew Dalke
dalke at dalkescientific.com
Thu Apr 7 05:53:37 UTC 2005
Ed:
> I'm glad someone is paying attention to these details.
That's what I'm paid for. :)
> Please also keep in mind making the parsing of the parameters
> straitforward. There should be some simple deterministic
> algorithm like, for example, this:
One concern I have is the mix of two different levels of URL
escaping. Suppose I searched for the "name" "Dalke,Andrew"
To make it fit in with at least Python's standard CGI GET
query parser it would need to look like this:
name=Dalke%252C+Andrew
>>> import cgi
>>> cgi.parse_qs("name=Dalke%252CAndrew")
{'name': ['Dalke%2CAndrew']}
>>>
There is a double escape of the "," because the normal
cgi processing rules allow anything to be escaped,
even the special characters of ';' and ','. If I
only had one level of escaping then it would be
processed as
>>> cgi.parse_qs("name=Dalke%2CAndrew")
{'name': ['Dalke,Andrew']}
>>>
Note that my search for the exact string "Dalke,Andrew"
has now been turned into a search for "Dalke" OR "Andrew".
I think the algorithm for parsing this string is as
follows.
1. URL-decode the query string
2. Split the string at ';'. These sub terms will be
ANDed together.
3. For each subterm, find the category (the part before
the '=' sign) and the value (the part after the '=').
4. Split the value on ','. These sub terms will be ORed
together.
5a. If the category is 'att' then look for the
attribute name before the ':' and the attribute
value after the ':'. This is a URL-encoded glob
query.
5b. Otherwise this is a URL-encoded query as
appropriate for the given category.
Steps 1, 2 and 3 are done as part of any standard CGI
library.
Another option which came up at the pre-DAS/2 meeting
at CSHL 1.5 years ago was use a Google (or Entrez ;)
style query, which would look like
type:exon contains:Chr1/50:99 name:"Dalke, Andrew"
or
pfam:"protein kinase" overlaps:Chr3/1000:2000:-1
Google allows a simple OR in its syntax.
pfam:"protein kinase" OR pfam:connexin inside:Chr3
The OR only applies to the two terms on either side
so this is the same as
(pfam:"protein kinase" OR pfam:connexin) AND inside:Chr3
I assume that
name:Andrew OR name:Gregg OR name:Ed AND inside:Chr3
would be the same as
((name:Andrew OR name:Gregg) OR name:Ed) AND inside:Chr3
And a recollection from logic class - all boolean queries can
be written in conjunctive normal form; as the AND of
a set of OR statements. It just might be very verbose. ;)
Note in my example that I've eliminated the "att"
prefix so there nothing visually to distinguish between
generic/database-defined attributes and DAS/2 required
properties. Do we need that? An advantage is that it
keeps the DAS/2 search names distinct from properties
in the database.
If that's important then I'll suggest we allow the
prefix of "att:" if people want to be sure that
there won't be a conflict, but that it's optional.
That is:
name:Andrew att:name:Andrew
selects features which have
the DAS/2 defined feature name of "Andrew" AND
the arbitrary database "name" attribute containing "Andrew"
The other possibility is to reserve "das:" for
fields defined in the spec, making this query
das:name:Andrew name:Andrew
or to make things prettier (I don't like the double ':')
das-name:Andrew name:Andrew
A few other things I've noticed in the query spec.
The definition for the "att" field says
Glob-style wildcards are allowed in the values.
There's an example which looks like
att=est_evidence:1 Match features with an
"est_evidence" property of "1"
Does this match "1" identically or does it also
match "100" and "91"? If the latter aren't allowed then
suppose I do
att=est_evidence:*1*
Are 100 and 91 now possible matches?
Why don't we weaken the requirements for how a server
must respond to a given query? Right now all attribute
fields must be searchable as strings by glob.
Some fields may only support searching by term,
perhaps with limited stemming. Others may be
numeric with no glob support at all.
Some implementers may even want to support range
searches for a given numeric field, perhaps like:
weight:1000..2500
BTW, why are we using "att" for this prefix instead of
"prop"? After all, aren't these the property names?
Speaking of which, given a property attribute name
like "weight" how does a client get to a description
of that property? It looks like it's magic - it's
tacked on to the base property URL so
base property URL: http://www/das/genome/volvox/1/property/
property name: weight
means the description can be found at:
http://www/das/genome/volvox/1/property/weight
Going the other way around, to get a list of potential
property fields, does a client start with the PROPERTIES
request at
http://www/das/genome/volvox/1/property/
then get the fully resolved 'id' for each PROPERTY
(which may be joined with the base or may be a
full URI itself) and extract the last term in the id?
If so, that won't work with LSIDs or other opaque
identifier. Better would be to add a "name" attribute
to the PROPERTY element so the whole thing looks like
<PROPERTIES>
<PROPERTY id="http://www/das/genome/volvox/1/property/weight"
name="weight" xs:type="xs:float"
definition="molecular weight in daltons"
/>
<PROPERTY id="urn:lsid:ibm.com:prop:pI"
name="PI" xs:type="xs:float"
definition="isoelectric point"
/>
</PROPERTIES>
and the 'name' field is what's used in the queries.
This could be expanded some more to flag which fields
are searchable and a documentation for how to do the
search. Eg, the "can_query" attribute if it exists
and is anything other the empty string means that
searches are allowed on the given property and the
"query_href" attribute points to documentation about
the format of that query.
<PROPERTIES>
<PROPERTY id="http://www/das/genome/volvox/1/property/weight"
name="weight" xs:type="xs:float"
definition="molecular weight in daltons"
can_query="T" query_href="http://www/help/query#mw"
/>
<PROPERTY id="urn:lsid:ibm.com:prop:pI"
name="PI" xs:type="xs:float"
definition="isoelectric point"
can_query="T" query_href="http://www/help/query#pI"
/>
</PROPERTIES>
Andrew
dalke at dalkescientific.com
More information about the DAS2
mailing list