[DAS2] feature filter examples
Andrew Dalke
dalke at dalkescientific.com
Fri Apr 8 06:10:57 UTC 2005
I just realized I forgot today's conference call!
Ed:
> Was that supposed to be a URL-encoded space? If so, is that your
> recommended way of encoding it (single-encoded vs. double-encoded) ?
I didn't answer that one. It doesn't make a difference. The
unescaping for both '+' and '%xx' are done by the same
piece of code.
I've been thinking more about simple query languages.
The one I proposed is Google like. I also remembered
that Lucene has a Google like query language.
http://lucene.apache.org/java/docs/queryparsersyntax.html
except it's rather more complicated because the
query engine is more powerful. And unlike Google
multiple terms with implicit conjunction are treated
as an OR search instead of an AND search. (So
"this that" is the same as "this OR that".)
Then there's MySQL's boolean search language at
http://dev.mysql.com/doc/mysql/en/fulltext-boolean.html
which has a different set of operations but doesn't
allow qualified searches like name:Romana .
All are different, but at least have a common core of
qualifier:search_text
Compare to Entrez which looks like
search_text[qualifier]
In taking a closer look at the corner cases I noticed
that it's awfully hard to handle non-alpha characters.
Consider a search for 5'. In Google that gets turned
into a search for 5. "5' end" (without double quotes
in the search) has a few non-relevant hits in the top
10, like for
NetScreen-5 End of Life (EOL) Announcement
How well do we want to support special characters?
DAS/1 includes examples like "5'UTR"
Is case sensitivity important?
How much should be required by the spec and how
much can be left up to the implementor?
Realistically speaking I don't think we can
make that many restrictions on what the server
can do.
As I mentioned, I'm not even sure if we can
always support the glob-style searches specified
by the current spec. (Eg, with numeric fields.
Also, how can it be implemented in MySQL?)
I'm fine with a simple query language like this:
query ::= term*
term ::=
"OR"
word
quoted_phrase
qualifier ":" word
qualifier ":" quoted_phrase
word ::= /[A-Za-z0-9_-]+/
quoted_phrase ::= /"[^"]*"/
qualifier ::= /[a-zA-z][a-zA-Z0-9_-]*/
If more selectivity is needed then
query ::= modified_term*
modified_term ::=
prefix term suffix
prefix ::= /[+-]?/ # +gene means "gene" must be found
# -gene means "gene" must not be present
suffix ::= /[*]?/ # gene* allow "gene" and "genetic"
This appears to be supportable under both MySQL and
Lucene.
Perhaps we can have another parameter to the
search to specify the query language ("ql") used.
If not specified the default is the DAS/2 query
language. This would let server implementers tweak
the interface as need be.
BTW, in looking over the DAS/1 spec the reason this
is an issue now is that we allow searches over
arbitrary fields and searches for something other
than an identifier.
Andrew
dalke at dalkescientific.com
More information about the DAS2
mailing list