[DAS2] properties and queries

Andrew Dalke dalke at dalkescientific.com
Tue Feb 7 12:19:28 UTC 2006


We've had a long discussion here about properties and how to
search them.  As it stands now the spec has a few holes in it.

Here are the properties we've talked about.

program_name: the program used to make the annotation, like
   "BLASTX 1.2.3"

notes:
   There can be 0 or more notes.  Notes might refer to other
   notes (eg, "the previous note said XYZ but I think ABC")

phase: (is it 0, 1, 2 or 1, 2, 3?)
   (And does anyone use this? People here don't use it; Thomas
    "reinfers it by counting along the transcript" "but maybe
    that's just me".  Others say they don't use the DAS1 phase.)

icon: a hypothetical image use for the feature, perhaps as
    a binary png;

curation history:
   a list of elements, each with
    - person
    - timestamp
    - reason for change

score: a floating point number, which may be in exponential
    notation like "1E-3"

Each one needs different search mechanisms.  For example,
   "annotations done by that buggy version of BLAST 1.2.3"
   "scores better than 1E-2"
   "changes by Andrew done in August 2004"
   "notes with the substring 'helicase'" (case sensitive or not?)
   "notes with the phrase 'E. Coli'" (substring might not work
       if there's the note has 'E.\nColi')

The property storage scheme doesn't handle this quite correctly.
Here are problems:

   - how do you store multiple notes?

Answer 1: use structured named, like "note_1", "note_2", "note_3", ..
HACK! Then what if a note is deleted?  Bigger problem; how do you
search the "note" field using the existing query language?

Answer 2: allow duplicate note elements, like
   <prop key="note" value="This is a note" />
   <prop key="note" value="The previous note is a lie!" />
   <prop key="note" value="Ignore the 2nd note - silly Cretan!" />

Question: so the order must be preserved if two fields have the
same name?  Can't implement with a dictionary/hash data type.

Question: what if there are duplicate "score" or "phase" elements?
Which one wins?

Answer 3: Notes are important and we know we need them now.
Let's have a <NOTE> element and not make it be a property.

<NOTE>This is a note</NOTE>
<NOTE>The previous note is a lie!</NOTE>
<NOTE>Is this an E or a NOT-E?</NOTE>

(perhaps also with timestamp and author name, but that's a different
question.)  Then we also define that the "note=" parameter in as
DAS query is a substring search of the <NOTE> elements of a feature.

I like this one.


   - How do you do numeric searches?

This is hypothetical.  There hasn't been a requirement for this.
'Course it may be because people haven't had the ability.  In
any case, how to search numeric fields like "score" with comparisons?


  - querying non-queryable fields

If there's embedded binary data, like an image, is it searchable?
Does a server complain and die? Ignore the request?

  - more complex text searches

"proteinase but not inhibitor"

  - complex data

We have support for non-DAS extensions, which might be

<sanger:curation-history xmlns:sanger="http://www.sanger.ac.uk/das/ext" 
 >
  <sanger:curation name="Andrew" date="2005-06-07">
    Change the this into that because of some reason or other
  </sanger:curation>


Thomas proposed that we support some sort of complex query
language, probably in XML, and get rid of the simple query scheme
we have now.

I argued against the complexity of that given that nearly all
of the queries will be "give me these feature types on this range
of that chromosome".  I also pointed out that developing a
generic query language is hard, and implementing it is harder.
Why require all that effort?

Roy commented the other way - in a server with only a few hundred
features, why require a query language at all?  Just return all
of the features in the request.

Here's what I proposed.

We have the "CATEGORY" (but after discussion I now want to take
it back to "CAPABILITY" since that's now much closer to what
it does - it describes where to go to do something)

So I'll use "CAPABILITY"

The current scheme has

<CAPABILITY type="features" query_url="http://...../features">
   <FORMAT ... />
</CAPABILITY>

This is an extensibility point.  Suppose Thomas has an XML
query search interface support on his server, with Sanger
clients that handle it.  Then there can be

<CAPABILITY type="thomas-xml-search" 
query_url="http.../search-features">
   <FORMAT ... />
</CAPABILITY>

A client can see the list of CAPABILITIES and decide to
use the feature search mechanism it likes best.

In addition, we could say that "this supports the normal DAS
query scheme but also supports extension vocabulary.  For example,

<CAPABILITY type="features" query_url="http://...../features">
   <SUPPORTS name="sanger-curation" />
   <FORMAT ... />
</CAPABILITY>

With this a client knows that the query_url supports the normal
DAS queries and also supports the "annotator", "annotation_before"
and "annotation_after" queries, like this

   .../features?annotator=Andrew;annotation_before=2005

Possible idea: if there is no SUPPORTs tag then the server
implements no search syntax and instead returns everything,
for the example Roy mentioned.

Okay, we're off to lunch.

					Andrew
					dalke at dalkescientific.com




More information about the DAS2 mailing list