[DAS2] feature filter examples

Andrew Dalke dalke at dalkescientific.com
Fri Apr 8 06:10:57 UTC 2005


I just realized I forgot today's conference call!


Ed:
> Was that supposed to be a URL-encoded space?  If so, is that your 
> recommended way of encoding it (single-encoded vs. double-encoded) ?

I didn't answer that one.  It doesn't make a difference.  The
unescaping for both '+' and '%xx' are done by the same
piece of code.

I've been thinking more about simple query languages.
The one I proposed is Google like.  I also remembered
that Lucene has a Google like query language.

http://lucene.apache.org/java/docs/queryparsersyntax.html

except it's rather more complicated because the
query engine is more powerful.  And unlike Google
multiple terms with implicit conjunction are treated
as an OR search instead of an AND search. (So
"this that" is the same as "this OR that".)

Then there's MySQL's boolean search language at
   http://dev.mysql.com/doc/mysql/en/fulltext-boolean.html
which has a different set of operations but doesn't
allow qualified searches like name:Romana .

All are different, but at least have a common core of

   qualifier:search_text

Compare to Entrez which looks like
   search_text[qualifier]


In taking a closer look at the corner cases I noticed
that it's awfully hard to handle non-alpha characters.
Consider a search for 5'.  In Google that gets turned
into a search for 5.  "5' end" (without double quotes
in the search) has a few non-relevant hits in the top
10, like for
   NetScreen-5 End of Life (EOL) Announcement

How well do we want to support special characters?
DAS/1 includes examples like "5'UTR"

Is case sensitivity important?

How much should be required by the spec and how
much can be left up to the implementor?

Realistically speaking I don't think we can
make that many restrictions on what the server
can do.

As I mentioned, I'm not even sure if we can
always support the glob-style searches specified
by the current spec.  (Eg, with numeric fields.
Also, how can it be implemented in MySQL?)

I'm fine with a simple query language like this:

query ::= term*

term ::=
    "OR"
    word
    quoted_phrase
    qualifier ":" word
    qualifier ":" quoted_phrase

word ::= /[A-Za-z0-9_-]+/
quoted_phrase ::= /"[^"]*"/
qualifier ::= /[a-zA-z][a-zA-Z0-9_-]*/

If more selectivity is needed then

query ::= modified_term*

modified_term ::=
    prefix term suffix

prefix ::= /[+-]?/  # +gene means "gene" must be found
                     # -gene means "gene" must not be present
suffix ::= /[*]?/   # gene* allow "gene" and "genetic"


This appears to be supportable under both MySQL and
Lucene.


Perhaps we can have another parameter to the
search to specify the query language ("ql") used.
If not specified the default is the DAS/2 query
language.  This would let server implementers tweak
the interface as need be.

BTW, in looking over the DAS/1 spec the reason this
is an issue now is that we allow searches over
arbitrary fields and searches for something other
than an identifier.

					Andrew
					dalke at dalkescientific.com




More information about the DAS2 mailing list