[DAS2] feature filter examples

Andrew Dalke dalke at dalkescientific.com
Thu Apr 7 05:53:37 UTC 2005


Ed:
> I'm glad someone is paying attention to these details.

That's what I'm paid for.  :)

> Please also keep in mind making the parsing of the parameters
> straitforward.  There should be some simple deterministic
> algorithm like, for example, this:

One concern I have is the mix of two different levels of URL
escaping.  Suppose I searched for the "name" "Dalke,Andrew"
To make it fit in with at least Python's standard CGI GET
query parser it would need to look like this:

   name=Dalke%252C+Andrew

 >>> import cgi
 >>> cgi.parse_qs("name=Dalke%252CAndrew")
{'name': ['Dalke%2CAndrew']}
 >>>

There is a double escape of the "," because the normal
cgi processing rules allow anything to be escaped,
even the special characters of ';' and ','.  If I
only had one level of escaping then it would be
processed as

 >>> cgi.parse_qs("name=Dalke%2CAndrew")
{'name': ['Dalke,Andrew']}
 >>>

Note that my search for the exact string "Dalke,Andrew"
has now been turned into a search for "Dalke" OR "Andrew".

I think the algorithm for parsing this string is as
follows.

   1. URL-decode the query string

   2. Split the string at ';'.  These sub terms will be
      ANDed together.

     3. For each subterm, find the category (the part before
        the '=' sign) and the value (the part after the '=').

     4. Split the value on ','.  These sub terms will be ORed
        together.

         5a. If the category is 'att' then look for the
          attribute name before the ':' and the attribute
          value after the ':'.  This is a URL-encoded glob
          query.

         5b. Otherwise this is a URL-encoded query as
          appropriate for the given category.

Steps 1, 2 and 3 are done as part of any standard CGI
library.

Another option which came up at the pre-DAS/2 meeting
at CSHL 1.5 years ago was use a Google (or Entrez ;)
style query, which would look like

   type:exon contains:Chr1/50:99 name:"Dalke, Andrew"

or

   pfam:"protein kinase" overlaps:Chr3/1000:2000:-1

Google allows a simple OR in its syntax.

   pfam:"protein kinase" OR pfam:connexin inside:Chr3

The OR only applies to the two terms on either side
so this is the same as

   (pfam:"protein kinase" OR pfam:connexin) AND inside:Chr3

I assume that

   name:Andrew OR name:Gregg OR name:Ed AND inside:Chr3

would be the same as

   ((name:Andrew OR name:Gregg) OR name:Ed) AND inside:Chr3

And a recollection from logic class - all boolean queries can
be written in conjunctive normal form; as the AND of
a set of OR statements.  It just might be very verbose.  ;)


Note in my example that I've eliminated the "att"
prefix so there nothing visually to distinguish between
generic/database-defined attributes and DAS/2 required
properties.  Do we need that?  An advantage is that it
keeps the DAS/2 search names distinct from properties
in the database.

If that's important then I'll suggest we allow the
prefix of "att:" if people want to be sure that
there won't be a conflict, but that it's optional.

That is:

   name:Andrew att:name:Andrew

selects features which have
    the DAS/2 defined feature name of "Andrew" AND
    the arbitrary database "name" attribute containing "Andrew"

The other possibility is to reserve "das:" for
fields defined in the spec, making this query

   das:name:Andrew name:Andrew

or to make things prettier (I don't like the double ':')

   das-name:Andrew name:Andrew



A few other things I've noticed in the query spec.

The definition for the "att" field says

     Glob-style wildcards are allowed in the values.

There's an example which looks like

         att=est_evidence:1         Match features with an
                                    "est_evidence" property of "1"

Does this match "1" identically or does it also
match "100" and "91"?  If the latter aren't allowed then
suppose I do

         att=est_evidence:*1*

Are 100 and 91 now possible matches?


Why don't we weaken the requirements for how a server
must respond to a given query?  Right now all attribute
fields must be searchable as strings by glob.

Some fields may only support searching by term,
perhaps with limited stemming.  Others may be
numeric with no glob support at all.

Some implementers may even want to support range
searches for a given numeric field, perhaps like:

   weight:1000..2500

BTW, why are we using "att" for this prefix instead of
"prop"?  After all, aren't these the property names?


Speaking of which, given a property attribute name
like "weight" how does a client get to a description
of that property?  It looks like it's magic - it's
tacked on to the base property URL so

  base property URL: http://www/das/genome/volvox/1/property/
  property name: weight

means the description can be found at:

    http://www/das/genome/volvox/1/property/weight

Going the other way around, to get a list of potential
property fields, does a client start with the PROPERTIES
request at

   http://www/das/genome/volvox/1/property/

then get the fully resolved 'id' for each PROPERTY
(which may be joined with the base or may be a
full URI itself) and extract the last term in the id?

If so, that won't work with LSIDs or other opaque
identifier.  Better would be to add a "name" attribute
to the PROPERTY element so the whole thing looks like

<PROPERTIES>
   <PROPERTY id="http://www/das/genome/volvox/1/property/weight"
      name="weight" xs:type="xs:float"
      definition="molecular weight in daltons"
    />
   <PROPERTY id="urn:lsid:ibm.com:prop:pI"
      name="PI" xs:type="xs:float"
      definition="isoelectric point"
    />
</PROPERTIES>

and the 'name' field is what's used in the queries.

This could be expanded some more to flag which fields
are searchable and a documentation for how to do the
search.  Eg, the "can_query" attribute if it exists
and is anything other the empty string means that
searches are allowed on the given property and the
"query_href" attribute points to documentation about
the format of that query.

<PROPERTIES>
   <PROPERTY id="http://www/das/genome/volvox/1/property/weight"
      name="weight" xs:type="xs:float"
      definition="molecular weight in daltons"
      can_query="T" query_href="http://www/help/query#mw"
    />
   <PROPERTY id="urn:lsid:ibm.com:prop:pI"
      name="PI" xs:type="xs:float"
      definition="isoelectric point"
      can_query="T" query_href="http://www/help/query#pI"
    />
</PROPERTIES>



					Andrew
					dalke at dalkescientific.com




More information about the DAS2 mailing list