O/R mapping [was Re: [Bioperl-l] pipeline]

Chris Mungall cjm@fruitfly.bdgp.berkeley.edu
Wed, 13 Mar 2002 11:27:37 -0800 (PST)


On Wed, 13 Mar 2002, Elia Stupka wrote:

> I refrained from replying too early, but I guess I wasn't clear, I never
> imagined to get auto-generated code that actually does what I want it to
> do, I just meant that it's really nice (as David was saying) just to cut
> the silly/boring initial development time, e.g. all get/sets,etc.

yup, i think i was overly negative in my appraisal of O/R tools. The best
ones out there are ones like David Block's Genquire and Tangram that take
care of the really mundane stuff and leave the coder to tweak the slightly
mundane stuff. It's still not ideal but it's about the best way we have of
working in the object-centic mode right now.

> > Riffing off of what Imre has been doing, I do thing that automatic mapping
> > (relational, xml etc) *does* make a lot of sense when you use an ontology
> > langauge to model your data.
> 
> Chris and ontology, love at first sight... :)
> 
> Any nice book/pages to read about ontology languages,etc? I'd like to
> understand more about them...

I guess www.daml.org is as good a place as any for a quick start

If you have the time, a better start would be a decent modern textbook on
AI. Can't recommend one, all mine are gathering dust.

The Ullman textbook on databases has a little section on datalog, see
below.
 
> > One thing that's different now is that now with postgres and soon with
> > mysql4.1 we have decent ways of doing proper relational stuff without
> > paying corporate bucks.
> 
> So thus mysql 4.1 support views and other funky stuff?

Supposedly it will, yup.
 
> > Another cool thing to explore is a predicate-logic interpretation of the
> > relational data, eg datalog, but that's another tangent.
> 
> Ah uh, in english?

Ummm, sorry, rabbiting on about prolog again. I'll be on about how lisp
s-exprs are better than xml and betraying my woolly AI background before
long.

Ok, this is going /way/ off whatever the original topic was...

Basically, tuples (rows) can be thought of as extensional predicates in
first order predicate logic. An example of a predicate, expressed in
prolog, is:

man(socrates).            

(i.e. socrates is a man)

Now, if you add the ability to have intensional predicates (which you
can't do in most relational systems, but can be easily done with a
combination of a db and a prolog engine, i.e datalog)

mortal(X):- man(X).      

(i.e. forall X, if X is a man, then X is mortal)

In prolog this is called a "horn clause".

So when you ask the system who is mortal...

? mortal(X).

It tells you

X = socrates

Woohoo!

note that if I ask

? mortal(elia).

then it will say "no". that's the closed world assumption for you.

Now, what the bleedin ell does this have to do with bioperl/ensembl, you
may be asking, (if you've actually read this far)?

Well it turns out that this a pretty smart way of specifying constraints,
queries and transformations.

It's especially useful for datasets that involve recursion, such as
anything involving concept hierarchies eg GO. For instance, we can define
transitive relationships like this -

statement(X, 'isa*', Y):- statement(X, 'isa', Y).
statement(X, 'isa*', Y):- statement(X, 'isa', Z), statement(Z, 'isa*', Y).

The equivalent code in the GO perl API isn't quite so concise.

Or let's say we want to query ensembl for nested exons; assuming a basic
gff type relation of seq/type/start/end/id for simplicity:

% infer introns from exons
gff(Seq, 'intron', Start, End, 'anon'):-
	gff(Seq, 'exon', _, Start, Exon1),
	gff(Seq, 'exon', End, _, Exon2),
	pair(Exon1, Exon2).

% for all X, if X is an exon, and X is subsumed by an intron
% then X is a nested exon
intronic_exon(IntronicExon):-
	gff(Seq, 'exon', ExStart, ExEnd, IntronicExon),
	gff(Seq, 'intron', InStart, InEnd, _),
	ExStart > InStart,
	ExEnd   < InEnd.

(Needs corrected for strandedness and off-by-ones)

It turns out that prolog clauses with certain properties have the same
expressive power as SQL, and can be mapped easily and efficiently (the
above could be done with sql views). with more expressive clauses, the
logic engine will have to do the extra work.

This could be used for - populating denormalised tables in the lite
schema, data sanity checks/constraints, complex queries. Or just for fun.

I have a feeling it will be useful for managing data of a comparative
nature. Excuses for introducing trees, graphs, fun things like that. Once
we have some decent comparative data for drosophila - v soon - I may
actually have to back up this statement.

Also very powerful for augmenting ontologies (although the stanard way to
do that is with more restrictive description logics like OIL, which have
more guarantees on returning with an answer this side of the next ice
age). So if you recast the ensembl object model as DAML+OIL (think UML on
steroids for the s/w engineering inclined) you can attach very powerful
constraints and transformations making your data rock solid. e.g. making
sure every confirmed gene has a certain kind of evidence. The nearest UML
has is OCL (Object Constraint Language) which I don't know much about but
looks a bit crap.

It's all logical, so it has a Mr.Spock-like appeal.

If anyone is remotely inclined after all that -

XSB - xsb.sourceforge.net - has ODBC bindings which works with postgres
SWI-Prolog - www.swi-prolog.org - postgres bindings are available

Both open source, and have good java/C bindings (but not perl alas)

> Sorry for being so honestly ignorant with my three questions above ;)

That's ok, it's quite possibly all completely irrelevant anyway.


And before someone tells me... yes, it is the data we care about here, not
the technology. I'm only getting into all this stuff now as I strongly
feel there has to be a better way of democratizing access to this data, of
giving people powerful ways of querying without having to read enormous
specs etc....
 
> Elia
> 
> -- 
> ********************************
> * http://www.fugu-sg.org/~elia *
> * tel:    +65 874 1467         *
> * mobile: +65 90307613         *
> * fax:    +65 777 0402         *
> ********************************
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>