[MOBY-l] Queries in bioMOBY

Fri Nov 8 13:08:55 UTC 2002

>>>>> "Chris" == Chris Mungall <cjm at fruitfly.org> writes:

  Chris> I'm now of the opinion that domain-specific APIs are
  Chris> fundamentally limited (albeit convenient). Wrapping a
  Chris> declarative language with an imperative one seems like a step
  Chris> backwards.

  Chris> What about a distributed solution that presents heterogeneous
  Chris> databases as if they were a unified single database? There
  Chris> would be one universal API allowing querying through some
  Chris> declarative language. I would make a distinction between
  Chris> services (eg a sequence analysis service) and data sources
  Chris> (eg a sequence database). Services would be function calls
  Chris> within the declarative language. Proxy servers could easily
  Chris> import functions and data to avoid the overhead of
  Chris> distributed joins at the cost of data lag.

There are several different systems that I know off for doing this
sort of thing. Although you might not like it, as it uses a domain
specific wrapper to draw things together, the TAMBIS project was
essentially meant to provide access to multiple heterogeneous
databases in this way. 

There is also a lot of work being done on distributed query
processing in a grid context. For instance, from my own group, take
this paper. 

http://www.cs.man.ac.uk/grid-db/papers/dqp.pdf

which includes a bioinformatics query, written in part, by yours
truly. It uses a variety of resources including a service (blast,
somewhat inevitably), and a couple of different data sources,
including something dodgy thing called the gene ontology.

It uses a declarative query language like which looks like...

select p1.proteinId, p2.proteinId
from p1 in protein, t1 in proteinTerm, 
     p2 in Blast(p1.sequence), t2 in proteinTerm,
where p1.proteinId=t1.proteinId and t1.termID="S0000092"

and so on. If memory serves this is OQL. It's fairly standard either
way. In this case the system was distributed because GO was on one
machine (as an SQL database), blast was on another machine (as an
executable), and I think that there was an object database involved
somewhere as well. 

It should be possible to layer this sort of functionality on top of
moby, or indeed any distributed middle ware system, although,
naturally, if you want it to work at all efficiently, then you are
heading for a whole mess of trouble. We have people working on this
in the context of mygrid, because, hey, we like pain. 

My own feeling about all of this, though, is that essentially the
query language is just the interface that you use to access the
data. So, when I use GO, I generally use the OO interface, because
it's a more obvious way to model a DAG, at least to my mind, so I
found it easier to operate over. Even if you fit on a query language,
you still have all the same problems that you would get if instead you
wrote an OO interface. That it, what to do if the network breaks, what
to do about semantic heterogeneity, and how to get the thing to work
faster and with less effort than it would take to download and install
all the data sources locally. 

Standard disclaimer...I'm just a poor biologist (from a poor family),
and not an expert on distributed query processing, so everything said
here could be complete rubbish. 

Phil