[MOBY-l] Queries in bioMOBY
Lukas Mueller
mueller at acoma.stanford.edu
Fri Nov 8 17:53:25 UTC 2002
Hi,
I recently heard at talk at a bay area bioinformatics meeting about a
company that has implemented exactly what Chris is talking about. I was
a bit skeptical of their implentation. What they do is develop wrappers
around web-pages and query them as web-services, presenting to the user
what looks like a big sql database. The company is at
www.geneticXchange.com. Unfortunately, they don't have that much
information on the site than the (I think) CEO presented in the talk. He
also gave a demo and the performance seemed quite good. It must be a
nightmare to maintain these wrappers.... as web-pages changes, these
wrappers have to be adapted. They have about 60 wrappers developed. They
really seem to be a poor man's ad-hoc moby, so I think that such a
system would be better built on top of something like moby. Of course
the system on top of the wrappers must be pretty neat.
Cheers
Lukas
On Friday, November 8, 2002, at 07:31 , Mark Wilkinson wrote:
> Hi Chris,
>
> I chuckled as I read your message, as I wrote an almost identical SQL
> query a couple of months ago in a discussion of my vision of MOBY with
> some of the Canadian Bioinformatics Integration Network participants :-)
>
> yes, I agree that this query language would be an extremely powerful
> tool, but , I don't think that your vision and my vision are at odds
> with each other - in fact, in some ways they compliment each other.
> Queries (at the level you are describing) are not something that I had
> ever envisioned would be explicitly a part of MOBY itself, but rather
> one of the tools that might be built on top of it. e.g. you expicitly
> say "FROM flybase.go_accession" wheras in the MOBY world the
> location(s) of go_accessions would be discovered for you... so the
> query would merely say "FROM go_accession" or some such thing.
>
> So... yeah... I think we are talking about different "layers", where
> your query lanauage would be sitting on top of MOBY. In addition,
> there is now no doubt at all in my mind that we want (need?) a metadata
> layer on top of MOBY (to more fully describe the transformations
> ocurring in the various services), but not necessarily a part of MOBY
> itself, and I see your query-engine sitting perhaps even on top of that
> metadata layer to make the semantic joins that I think you are
> describing in your example
>
> We probably do lose *some* power in our object-oriented approach, but I
> doubt that we lose as much as you are suggesting since our objects are
> supposed to be so lightweight that they represent only one or two (as
> few as possible) database fields per object, and these are preferably
> just representing database "keys". In this sense, your query language
> should be quite happy on top of MOBY, since your query is basically
> joining tables on their (semantic) foreign keys... Tackling the
> problem in this way also seems to solve one of your concerns - that
> 'killer queries" probably don't have their killer computation done on
> the database server, but rather in the machine running the query
> engine, since the joins are not done on the server per se... though
> admittedly, I haven't thought through this problem as well as you have
> so that might be rubbish-talk.
>
> It might be an idea, as a MOBY use case, to take a query like Chris has
> just provided and see if we can generate a workflow that solves it.
> This is *definately* something that I had hoped MOBY could do (Daminan,
> I would put this in the use-case catagory of "where MOBY could really
> shine"), but I'm not sure if we are there yet. A couple of fully
> fleshed-out use case SQL-like queries, with their solutions would help
> clarify whether there is something missing from MOBY that would get us
> to this point.
>
> M
>
>
> Chris Mungall wrote:
>> I may be off the mark here, haven't read the moby docs in detail, but I
>> believe one issue MOBY doesn't seem to address is querying. This
>> shortcoming is inherent in the prevailing object oriented architecture
>> paradigm, rather than moby per se. Asking for objects by identifiers
>> only
>> gets you so far, most interesting bioinformatics involves querying.
>> GO seems to be a common use case - I wrote an OO API for the GO
>> database.
>> It saves a lot of time in terms of the most common operations - fetch
>> graph by GO ID, search for terms, search for annotated products etc.
>> But
>> eventually you want to start doing complex queries such as "find all
>> mammalian gene products that are involved in process 'transcription'
>> but
>> are not transcription factors" or somesuch; this can either be done
>> imperatively by the API client (inefficient, slow, cumbersome), or the
>> API
>> designer could provide the API client programmer with an ad hoc query
>> language (which is silly - you may as well use SQL).
>> I'm now of the opinion that domain-specific APIs are fundamentally
>> limited
>> (albeit convenient). Wrapping a declarative language with an imperative
>> one seems like a step backwards.
>> What about a distributed solution that presents heterogeneous
>> databases as
>> if they were a unified single database? There would be one universal
>> API
>> allowing querying through some declarative language. I would make a
>> distinction between services (eg a sequence analysis service) and data
>> sources (eg a sequence database). Services would be function calls
>> within
>> the declarative language. Proxy servers could easily import functions
>> and
>> data to avoid the overhead of distributed joins at the cost of data
>> lag.
>> The framework for this could be xml schema + xml query language (like
>> DiGIR?), OWL/DAML+OIL plus associated query language, or a relational
>> model plus SQL. The latter seems the most sensible - robust open source
>> technology and (very importantly) a sound theoretical underpinning.
>> The key elements are:
>> - data exposure, via a universal data model; for example: relations,
>> trees
>> (xml). objects are ill-suited due to the lack of strong theoretical
>> underpinnings.
>> - querying, via a universal, expressive, declarative language
>> I vaguely remember a lot of talk about federated databases when I was
>> starting out in bioinformatics. Nothing much came of this. I put this
>> down
>> to immature technology (no postgres and the other DBs cost $$$) and an
>> influx of OO programmers which led to CORBA (myself included). Maybe
>> it's
>> time for a revival? All this stuff goes in circles anyway....
>> Of course, one problem with a powerful declarative interface vs a weak
>> imperative one is it's easy to launch server-killing queries. (But of
>> course it would be possible to guard against this with some kind of
>> anti-server-hogging daemon).
>> As an example, the following query would find the swissprot sequence of
>> all fly DNA Binding (including subtypes of DNA Binding) proteins and
>> then
>> blast them against nr:
>> SELECT
>> sptr:seq.display_id,
>> sptr:seq.description,
>> local:myfilter(ncbi:blastall('blastp',
>> '-filter SEG+XNU',
>> 'nr',
>> sptr:seq.residues)),
>> FROM
>> sptr:seq NATURAL JOIN go:term NATURAL JOIN go:closure
>> NATURAL JOIN flybase:go_association
>> NATURAL JOIN flybase:gene_product NATURAL JOIN sptr:seq
>> WHERE
>> go:term.name = 'DNA Binding'
>> AND
>> flybase:go_association.is_curated = TRUE
>> AND
>> flybase:gene_product.species = 'D melanogaster'
>> WITH NAMESPACES
>> 'http://www.ebi.ac.uk/sptr/moby' AS sptr,
>> 'http://www.flybase.org/moby' AS flybase,
>> 'http://www.geneontology.org/moby' AS go
>> ;
>> OK, it's a bit wordy, but it is a complex query - I personally would
>> rather write a query like this over allowing an API to make all the
>> important decisions (closure, evidence, species etc) for me. Plus, it's
>> easy to see how you would extend this - say to get only fly proteins
>> that
>> are expressed at certain places/times.
>> (It's a somewhat disingenous example, as all the data is currently
>> available in one tablespace in the GO warehouse database anyway, or in
>> flybase)
>> (There is also a slight cheat involving the transitive closure table in
>> the above example - this couldn't be done purely with natural joins)
>> If the above query is too slow, I can easily build my own local
>> warehouse
>> copy of the table I need like this:
>> INSERT INTO local:seq AS SELECT * FROM sptr:seq;
>> And I can also just compile the 'blastall' function into my local copy
>> of
>> postgres.
>> And if I really want to have a simple API wrapper that cans common
>> queries
>> I can still do it - the difference is the API isn't exposed across the
>> wire, only the SQL+relations (or xml-queries + xml, or S-expressions +
>> lisp functions) are.
>> I know this all sounds very retro and doesn't take into account all the
>> latest SOAP + EJB stack type technology, but we don't have to follow
>> the
>> market-led software engineering herd into every dreadful
>> committee-designed anacronymistic technology.
>> The hard part is optimising distributed joins, but I imagine this has
>> been
>> solved over and over in various CS projects, it's just a question of
>> waiting until this makes its way into existing robust dbs like
>> postgres.
>> (postgres already has nice extensible functions, so the above example
>> is
>> i think do-able on a purely local installation).
>> Am I way out on a limb here, is there any room for this sort of thing
>> in
>> the bioMOBY world?
>> --
>> chris
>> _______________________________________________
>> moby-l mailing list
>> moby-l at biomoby.org
>> http://biomoby.org/mailman/listinfo/moby-l
>
>
> -- --------------------------------
> "Speed is subsittute fo accurancy."
> ________________________________
>
> Dr. Mark Wilkinson, RA Bioinformatics
> National Research Council, Plant Biotechnology Institute
> 110 Gymnasium Place, Saskatoon, SK, Canada
>
> phone : (306) 975 5279
> pager : (306) 934 2322
> mobile: markw_mobile at illuminae dot com
>
>
> _______________________________________________
> moby-l mailing list
> moby-l at biomoby.org
> http://biomoby.org/mailman/listinfo/moby-l
>
More information about the moby-l
mailing list