[MOBY-l] Re: Genomic position-based GO search...

Fri Nov 22 05:32:19 UTC 2002

[Im reposting this for Chris who was having problems getting things sent
to theMoby-l, apologies if it appears twice as a result!]

> Hi Chris
>
> Your idea for #1 is intriguing and probably raises many questions in
> and of itself.  Certainly the nice feature about GO (eg. Amigo) is that
> you can run cross-organism queries in one place and this integration
> using the GO IDs as the primary key works very well. I guess I
> ultimately see the various databases we run as essentially being tables
> in some uber-database,  connected together by our common primary keys
> (accession numbers), the stumbling block currently being how to run the
> queries (we dont know whats in the tables and what the keys between the
> tables are). The question becomes how to achieve something like that -
> by having everything in one place and forced into a common structure to
> define the tables and keys or by distributing the components and having
> them connected by common protocols providing a level of abstraction
> between the query and the underlying structure.
>
> Im not sure I fully understand what you mean in #2  that the approach
> doesn't scale - do you mean that we'd be trying to do too much in the
> 'transform' step of the process and we'd end up writing lots of
> different APIs to handle slight differences in the query? Perhaps you
> could expand on your experience on the GO API as this might help others
> understand the practical limitations of these things based on your
> experience? Naively I cant see any reason why such a service couldnt be
> written but I dont have practical experience suggesting why it would be
> a bad thing in practice.

Chris writes:

The GO API allows you to do stuff like

fetch term by GO ID

fetch terms by search string

fetch products by GO ID, (including subtypes)

fetch graph around GO ID

fetch GO ID by product ID

and so on. this is very useful and nice.

then later on you start wanting more fine grained control - filter by
evidence code. filter by species/source. you also want control over how
much data is fetched in one API call vs how much to leave as stubs to be
lazy loaded. Follow the closure of the graph vs not following the closure.
As these are added, the API accumulates more abstruse methods, more
optional parameters, more pragmas.

All these query constraints are ANDed together - eventually you get to the
point where you want to do more complex boolean queries (eg all products
of function X not of process Y). These haven't been implemented yet,
partly because to do so would require bolting on an ad-hoc query language
to the API. At which point the benefits vs hassle of having an API becomes
vanishingly small. And this is just for GO, which is incredibly simple.
For data richer than GO or DAS it starts getting silly.