[MOBY-l] Queries in bioMOBY
Mark Wilkinson
mwilkinson at gene.pbi.nrc.ca
Fri Nov 8 15:31:56 UTC 2002
Hi Chris,
I chuckled as I read your message, as I wrote an almost identical SQL
query a couple of months ago in a discussion of my vision of MOBY with
some of the Canadian Bioinformatics Integration Network participants :-)
yes, I agree that this query language would be an extremely powerful
tool, but , I don't think that your vision and my vision are at odds
with each other - in fact, in some ways they compliment each other.
Queries (at the level you are describing) are not something that I had
ever envisioned would be explicitly a part of MOBY itself, but rather
one of the tools that might be built on top of it. e.g. you expicitly
say "FROM flybase.go_accession" wheras in the MOBY world the location(s)
of go_accessions would be discovered for you... so the query would
merely say "FROM go_accession" or some such thing.
So... yeah... I think we are talking about different "layers", where
your query lanauage would be sitting on top of MOBY. In addition, there
is now no doubt at all in my mind that we want (need?) a metadata layer
on top of MOBY (to more fully describe the transformations ocurring in
the various services), but not necessarily a part of MOBY itself, and I
see your query-engine sitting perhaps even on top of that metadata layer
to make the semantic joins that I think you are describing in your example
We probably do lose *some* power in our object-oriented approach, but I
doubt that we lose as much as you are suggesting since our objects are
supposed to be so lightweight that they represent only one or two (as
few as possible) database fields per object, and these are preferably
just representing database "keys". In this sense, your query language
should be quite happy on top of MOBY, since your query is basically
joining tables on their (semantic) foreign keys... Tackling the problem
in this way also seems to solve one of your concerns - that 'killer
queries" probably don't have their killer computation done on the
database server, but rather in the machine running the query engine,
since the joins are not done on the server per se... though admittedly,
I haven't thought through this problem as well as you have so that might
be rubbish-talk.
It might be an idea, as a MOBY use case, to take a query like Chris has
just provided and see if we can generate a workflow that solves it.
This is *definately* something that I had hoped MOBY could do (Daminan,
I would put this in the use-case catagory of "where MOBY could really
shine"), but I'm not sure if we are there yet. A couple of fully
fleshed-out use case SQL-like queries, with their solutions would help
clarify whether there is something missing from MOBY that would get us
to this point.
M
Chris Mungall wrote:
> I may be off the mark here, haven't read the moby docs in detail, but I
> believe one issue MOBY doesn't seem to address is querying. This
> shortcoming is inherent in the prevailing object oriented architecture
> paradigm, rather than moby per se. Asking for objects by identifiers only
> gets you so far, most interesting bioinformatics involves querying.
>
> GO seems to be a common use case - I wrote an OO API for the GO database.
> It saves a lot of time in terms of the most common operations - fetch
> graph by GO ID, search for terms, search for annotated products etc. But
> eventually you want to start doing complex queries such as "find all
> mammalian gene products that are involved in process 'transcription' but
> are not transcription factors" or somesuch; this can either be done
> imperatively by the API client (inefficient, slow, cumbersome), or the API
> designer could provide the API client programmer with an ad hoc query
> language (which is silly - you may as well use SQL).
>
> I'm now of the opinion that domain-specific APIs are fundamentally limited
> (albeit convenient). Wrapping a declarative language with an imperative
> one seems like a step backwards.
>
> What about a distributed solution that presents heterogeneous databases as
> if they were a unified single database? There would be one universal API
> allowing querying through some declarative language. I would make a
> distinction between services (eg a sequence analysis service) and data
> sources (eg a sequence database). Services would be function calls within
> the declarative language. Proxy servers could easily import functions and
> data to avoid the overhead of distributed joins at the cost of data lag.
>
> The framework for this could be xml schema + xml query language (like
> DiGIR?), OWL/DAML+OIL plus associated query language, or a relational
> model plus SQL. The latter seems the most sensible - robust open source
> technology and (very importantly) a sound theoretical underpinning.
>
> The key elements are:
>
> - data exposure, via a universal data model; for example: relations, trees
> (xml). objects are ill-suited due to the lack of strong theoretical
> underpinnings.
>
> - querying, via a universal, expressive, declarative language
>
> I vaguely remember a lot of talk about federated databases when I was
> starting out in bioinformatics. Nothing much came of this. I put this down
> to immature technology (no postgres and the other DBs cost $$$) and an
> influx of OO programmers which led to CORBA (myself included). Maybe it's
> time for a revival? All this stuff goes in circles anyway....
>
> Of course, one problem with a powerful declarative interface vs a weak
> imperative one is it's easy to launch server-killing queries. (But of
> course it would be possible to guard against this with some kind of
> anti-server-hogging daemon).
>
> As an example, the following query would find the swissprot sequence of
> all fly DNA Binding (including subtypes of DNA Binding) proteins and then
> blast them against nr:
>
> SELECT
> sptr:seq.display_id,
> sptr:seq.description,
> local:myfilter(ncbi:blastall('blastp',
> '-filter SEG+XNU',
> 'nr',
> sptr:seq.residues)),
> FROM
> sptr:seq NATURAL JOIN go:term NATURAL JOIN go:closure
> NATURAL JOIN flybase:go_association
> NATURAL JOIN flybase:gene_product NATURAL JOIN sptr:seq
> WHERE
> go:term.name = 'DNA Binding'
> AND
> flybase:go_association.is_curated = TRUE
> AND
> flybase:gene_product.species = 'D melanogaster'
> WITH NAMESPACES
> 'http://www.ebi.ac.uk/sptr/moby' AS sptr,
> 'http://www.flybase.org/moby' AS flybase,
> 'http://www.geneontology.org/moby' AS go
> ;
>
> OK, it's a bit wordy, but it is a complex query - I personally would
> rather write a query like this over allowing an API to make all the
> important decisions (closure, evidence, species etc) for me. Plus, it's
> easy to see how you would extend this - say to get only fly proteins that
> are expressed at certain places/times.
>
> (It's a somewhat disingenous example, as all the data is currently
> available in one tablespace in the GO warehouse database anyway, or in
> flybase)
>
> (There is also a slight cheat involving the transitive closure table in
> the above example - this couldn't be done purely with natural joins)
>
> If the above query is too slow, I can easily build my own local warehouse
> copy of the table I need like this:
>
> INSERT INTO local:seq AS SELECT * FROM sptr:seq;
>
> And I can also just compile the 'blastall' function into my local copy of
> postgres.
>
> And if I really want to have a simple API wrapper that cans common queries
> I can still do it - the difference is the API isn't exposed across the
> wire, only the SQL+relations (or xml-queries + xml, or S-expressions +
> lisp functions) are.
>
> I know this all sounds very retro and doesn't take into account all the
> latest SOAP + EJB stack type technology, but we don't have to follow the
> market-led software engineering herd into every dreadful
> committee-designed anacronymistic technology.
>
> The hard part is optimising distributed joins, but I imagine this has been
> solved over and over in various CS projects, it's just a question of
> waiting until this makes its way into existing robust dbs like postgres.
> (postgres already has nice extensible functions, so the above example is
> i think do-able on a purely local installation).
>
> Am I way out on a limb here, is there any room for this sort of thing in
> the bioMOBY world?
>
> --
> chris
>
> _______________________________________________
> moby-l mailing list
> moby-l at biomoby.org
> http://biomoby.org/mailman/listinfo/moby-l
>
--
--------------------------------
"Speed is subsittute fo accurancy."
________________________________
Dr. Mark Wilkinson, RA Bioinformatics
National Research Council, Plant Biotechnology Institute
110 Gymnasium Place, Saskatoon, SK, Canada
phone : (306) 975 5279
pager : (306) 934 2322
mobile: markw_mobile at illuminae dot com
More information about the moby-l
mailing list