[MOBY-l] Queries in bioMOBY

Fri Nov 8 01:25:35 UTC 2002

I may be off the mark here, haven't read the moby docs in detail, but I
believe one issue MOBY doesn't seem to address is querying. This
shortcoming is inherent in the prevailing object oriented architecture
paradigm, rather than moby per se. Asking for objects by identifiers only
gets you so far, most interesting bioinformatics involves querying.

GO seems to be a common use case - I wrote an OO API for the GO database.
It saves a lot of time in terms of the most common operations - fetch
graph by GO ID, search for terms, search for annotated products etc. But
eventually you want to start doing complex queries such as "find all
mammalian gene products that are involved in process 'transcription' but
are not transcription factors" or somesuch; this can either be done
imperatively by the API client (inefficient, slow, cumbersome), or the API
 designer could provide the API client programmer with an ad hoc query
language (which is silly - you may as well use SQL).

I'm now of the opinion that domain-specific APIs are fundamentally limited
(albeit convenient). Wrapping a declarative language with an imperative
one seems like a step backwards.

What about a distributed solution that presents heterogeneous databases as
if they were a unified single database? There would be one universal API
allowing querying through some declarative language. I would make a
distinction between services (eg a sequence analysis service) and data
sources (eg a sequence database). Services would be function calls within
the declarative language. Proxy servers could easily import functions and
data to avoid the overhead of distributed joins at the cost of data lag.

The framework for this could be xml schema + xml query language (like
DiGIR?), OWL/DAML+OIL plus associated query language, or a relational
model plus SQL. The latter seems the most sensible - robust open source
technology and (very importantly) a sound theoretical underpinning.

The key elements are:

- data exposure, via a universal data model; for example: relations, trees
(xml). objects are ill-suited due to the lack of strong theoretical
underpinnings.

- querying, via a universal, expressive, declarative language

I vaguely remember a lot of talk about federated databases when I was
starting out in bioinformatics. Nothing much came of this. I put this down
to immature technology (no postgres and the other DBs cost $$$) and an
influx of OO programmers which led to CORBA (myself included). Maybe it's
time for a revival? All this stuff goes in circles anyway....

Of course, one problem with a powerful declarative interface vs a weak
imperative one is it's easy to launch server-killing queries. (But of
course it would be possible to guard against this with some kind of
anti-server-hogging daemon).

As an example, the following query would find the swissprot sequence of
all fly DNA Binding (including subtypes of DNA Binding) proteins and then
blast them against nr:

SELECT
 sptr:seq.display_id,
 sptr:seq.description,
 local:myfilter(ncbi:blastall('blastp',
                              '-filter SEG+XNU',
                              'nr',
                              sptr:seq.residues)),
FROM
  sptr:seq NATURAL JOIN go:term NATURAL JOIN go:closure
  NATURAL JOIN flybase:go_association
  NATURAL JOIN flybase:gene_product NATURAL JOIN sptr:seq
WHERE
  go:term.name = 'DNA Binding'
 AND
  flybase:go_association.is_curated = TRUE
 AND
  flybase:gene_product.species = 'D melanogaster'
WITH NAMESPACES
 'http://www.ebi.ac.uk/sptr/moby' AS sptr,
 'http://www.flybase.org/moby' AS flybase,
 'http://www.geneontology.org/moby' AS go
;

OK, it's a bit wordy, but it is a complex query - I personally would
rather write a query like this over allowing an API to make all the
important decisions (closure, evidence, species etc) for me. Plus, it's
easy to see how you would extend this - say to get only fly proteins that
are expressed at certain places/times.

(It's a somewhat disingenous example, as all the data is currently
available in one tablespace in the GO warehouse database anyway, or in
flybase)

(There is also a slight cheat involving the transitive closure table in
the above example - this couldn't be done purely with natural joins)

If the above query is too slow, I can easily build my own local warehouse
copy of the table I need like this:

INSERT INTO local:seq AS SELECT * FROM sptr:seq;

And I can also just compile the 'blastall' function into my local copy of
postgres.

And if I really want to have a simple API wrapper that cans common queries
I can still do it - the difference is the API isn't exposed across the
wire, only the SQL+relations (or xml-queries + xml, or S-expressions +
lisp functions) are.

I know this all sounds very retro and doesn't take into account all the
latest SOAP + EJB stack type technology, but we don't have to follow the
market-led software engineering herd into every dreadful
committee-designed anacronymistic technology.

The hard part is optimising distributed joins, but I imagine this has been
solved over and over in various CS projects, it's just a question of
waiting until this makes its way into existing robust dbs like postgres.
(postgres already has nice extensible functions, so the above example is
i think do-able on a purely local installation).

Am I way out on a limb here, is there any room for this sort of thing in
the bioMOBY world?

--
chris