[Biojava-l] Re: persistence - and the problems with it (Gerald Loeffler)

Aaron Kitzmiller AKitzmiller@genetics.com
Wed, 03 May 2000 14:50:05 -0400


Sorry, folks, I've been in meetings all day and am just now catching up.

Many of the disadvantages brought up are very legit points.  Hand writing O/R mappings is tedious.  Working with relational database applications that don't use objects, however, is even more tedious and error-prone.  

As Dr. Brocklehurst noted, the task still has to be done.  Large pharma and biotech companies are not moving to object databases any time soon, especially given the huge amount of infrastructure, tools and knowledge currently available.  In addition, we buy more databases than we build and the few bioinformatics software companies that have offered object database applications in the past are moving them to relational format.  I prefer the suggestion of creating a general, open-source O/R mapping tool.  In the absence of that, what we could provide is a set of interfaces and a few simple implementations that could work reasonably well under certain circumstances.  With proper use of interfaces, the home-grown code could be replaced with a commercial solution when possible.

As mentioned, the commercially available tools are superior to naive, home-grown options in many ways; TOPLink, I know, permits object caching, and provides a GUI tool for doing the mappings.  I believe JavaBlend is improving, though I haven't revisited it in a while.  There is definitely a serious performance hit in the absence of such caching, though there are a number of ways to alleviate this problem.  

Why not use a third party O/R mapping tool?  Though it is certainly a matter of debate, I like to stay away from tools or libraries that have a steep learning curve, yet are not in some way standard.  A lot of the people employed in bioinformatics groups are not pure computer scientists, i.e. they have a biological science background.  It is difficult enough to move to object-oriented programming (which actually makes sense to a domain expert), without requiring knowledge of proprietary libraries.  If those libraries are very standard or are likely to be (JDBC, Swing, biojava itself), or represent well understood interfaces (jgl), it's not such a problem.  

I've used simple O/R mapping techniques in the past and there are some things you can do to lessen the performance hit.  First, it helps to use servlets for database access, especially if your web server and database server have a nice, fat pipe between them.  Second, lazy instantiation of the more complex elements can help a lot.  For example, a proxy object for sequence bases or features can make sequence objects easier to retrieve.  

Aaron K.

Aaron Kitzmiller
Manager Systems Development -Cambridge
Bioinformatics Department
35 Cambridge Park Dr.
Cambridge, MA 02140
Phone: (617) 665-6831
Fax: (617) 665-8870
Email: akitzmiller@genetics.com

>>> <biojava-l-admin@biojava.org> 05/03 12:00 PM >>>

Send Biojava-l mailing list submissions to
	biojava-l@biojava.org 

To subscribe or unsubscribe via the web, visit
	http://biojava.org/mailman/listinfo/biojava-l 
or, via email, send a message with subject or body 'help' to
	biojava-l-request@biojava.org 
You can reach the person managing the list at
	biojava-l-admin@biojava.org 

When replying, please edit your Subject line so it is more specific than
"Re: Contents of Biojava-l digest..."


Today's Topics:

  1. persistence - and the problems with it (Gerald Loeffler)
  2. Re: persistence - and the problems with it (Simon Brocklehurst)
  3. Re: persistence - and the problems with it (Thomas Down)
  4. Re: persistence - and the problems with it (Gerald Loeffler)

--__--__--

Message: 1
Date: Tue, 02 May 2000 23:58:13 +0200
From: Gerald Loeffler <Gerald.Loeffler@vienna.at>
Organization: Daemonstration Software Consulting
To: biojava-l@biojava.org 
CC: akitzmiller@genetics.com 
Subject: [Biojava-l] persistence - and the problems with it

Hi!

Let me comment on something that has been said on this list on the topic
of making BioJava objects persistent:

1) Java Serialisation is a very bad way of making objects persistent. It
is very slow; it leads to _enormously_ big data stores; there are
serious problems with preservation of object identity; it is almost
impossible to handle more objects then fit into main memory at any time;
it is not transactionally safe; it does not offer networked access; it
does not offer a way to query the persistent objects; and so on and so
forth. In other words: I've never heard of a serious project that used
Java Serialisation as the persistence mechanism. (Of course it's easy to
"serialise your Java objects away" - but in this way you can not build
up a database of persisted objects!)

2) It has been said that "For large companies, object databases don't
make much sense". Oh well. Firstly I think that pure object databases
_do_ make sense - but I realize that this is a religious debate to some.
Much more important is this: To use an object-oriented API for
persistence does _not_ say that you need to use a pure object-oriented
database as your database backend - you _can_ but you need not! The
persistence API is one thing - the database is another thing: it may be
relational with an object-relational mapping tool on top; or it may be a
pure object database.

3) The only standardised API for transparent object-persistence from
Java to this date is the ODMG 3.0 Java binding (http://www.odmg.org).
(It's successor, the Java Data Objects is underway:
http://java.sun.com/products/jdbc/related.html). It offers a portable,
very natural (IMHO) way to make Java objects persistent and to query the
data store for objects with certain properties. Very few implementations
of this API exist to date. There _are_ however pure object databases as
well as object-relational mapping tools that support this API - i.e.,
you can use Oracle as your database backend if you really like.

4) Writing explicit code (using JDBC) to persist a complex network of
Java objects (an Alignment and its Sequences and all its Annotations and
Features and so on) into a relational database is _very_ tedious and
error-prone. I honestly can't imagine doing this for all the classes
(interfaces) in BioJava! Besides, unless you are clever with caching and
so on, your performance will be lousy (because you are triggering _a
lot_ of very small database requests - at least one for each (usually
very fine grained) object.) - and if you do clever caching, you are
essentially implementing your own object-relational mapping tool:

5) Object-relational mapping tools are very sophisticated software
products - hence their price tag. The good ones transparently and
efficiently map the Java-side onto the relational database-side and
vice-versa - i.e. they (automatically) generate and use a relational
database schema from your object model (.java files). They make all the
JDBC-coding unnecessary. They make your relational database look like an
object database (i.e. Oracle can suddenly be programmed using e.g. the
ODMG Java binding.) They cache your objects, preserve object identity
accross distributed caches, know how to perform queries... I'd never
dream about starting a project to write such a beast from scratch when
there are quite a few companies who specialise in this...

	sorry for the many words (-:
	gerald
-- 
   Gerald.Loeffler@vienna.at _________________ Software Architect
   http://www.imp.univie.ac.at ____ http://www.daemonstration.com 
   OOA&D, Java, J2EE, JSP, Servlets, JavaBeans, ODBMS, RDBMS, XML

--__--__--

Message: 2
Date: Wed, 03 May 2000 15:45:37 +0100
From: Simon Brocklehurst <simon.brocklehurst@CambridgeAntibody.com>
Organization: Cambridge Antibody Technology Group plc
To: biojava-l@biojava.org 
CC: Gerald Loeffler <Gerald.Loeffler@vienna.at>
Subject: Re: [Biojava-l] persistence - and the problems with it

Hi Gerald,

I thought you'd have caused more of a stir with that post - I certainly
enjoyed it! Seeing as no-one else has replied yet...

Gerald Loeffler wrote:

> Hi!
>
> Let me comment on something that has been said on this list on the topic
> of making BioJava objects persistent:
>
> 1) Java Serialisation is a very bad way of making objects persistent.

Agreed!  There are just sooooooo many bad things about Java serialization...

> 4) Writing explicit code (using JDBC) to persist a complex network of
> Java objects (an Alignment and its Sequences and all its Annotations and
> Features and so on) into a relational database is _very_ tedious and
> error-prone. I honestly can't imagine doing this for all the classes
> (interfaces) in BioJava!

Yes it's tedious, but whilst it's easy to make errors writing lots of
database calls by using JDBC, it's really a long way from being impossible
to do it correctly.

> Besides, unless you are clever with caching and
> so on, your performance will be lousy (because you are triggering _a
> lot_ of very small database requests - at least one for each (usually
> very fine grained) object.) - and if you do clever caching, you are
> essentially implementing your own object-relational mapping tool:

True - but I'm not exactly clear what you're trying to say here. I'm
definitely getting the impression that you personally don't want to do this
;-) But do you think other people:

  a) Shouldn't do it
  b) Can't do it
  c) Should go ahead and do it if they want to
  d) Should consider collaborating on a general open source
object-relational mapping tool, rather than writing something specific for
biojava.
  e) Do something else

> 5) Object-relational mapping tools are very sophisticated software
> products - hence their price tag. The good ones transparently and
> efficiently map the Java-side onto the relational database-side and
> vice-versa - i.e. they (automatically) generate and use a relational
> database schema from your object model (.java files). They make all the
> JDBC-coding unnecessary. They make your relational database look like an
> object database (i.e. Oracle can suddenly be programmed using e.g. the
> ODMG Java binding.) They cache your objects, preserve object identity
> accross distributed caches, know how to perform queries... I'd never
> dream about starting a project to write such a beast from scratch when
> there are quite a few companies who specialise in this...

That's fine.  The only thing is, much of the potential user community of
Biojava may not have the budget to buy expensive Enterprise-class software.
So if persistence of objects is a goal of biojava, whatever the solution is
should probably not rely on costly infrastructure.

Are there any mature, high-quality, feature-rich tools that get you where
need to go in terms of developing high-performance systems?  In your
experience what are the best commercial Java object-relational mapping
tools?  What are the benefits, if any, of the commercial tools over the free
tools.

Also which do you think is the best pure object-database for dealing with
Java objects?

You didn't discuss using XML representations of  biojava objects.  That
might offer a reasonable way to allow a wide variety of types of user to
exploit biojava. Once you have the XML you can do what you like with it...

What do you think?
--
Simon M. Brocklehurst, Ph.D.
Head of Bioinformatics & Advanced IS
Cambridge Antibody Technology
The Science Park, Melbourn, Cambridgeshire, UK
http://www.CambridgeAntibody.com/ 
mailto:simon.brocklehurst@CambridgeAntibody.com 


--__--__--

Message: 3
Date: Wed, 3 May 2000 16:55:29 +0100
From: Thomas Down <td2@sanger.ac.uk>
To: Simon Brocklehurst <simon.brocklehurst@CambridgeAntibody.com>
Cc: biojava-l@biojava.org, Gerald Loeffler <Gerald.Loeffler@vienna.at>
Subject: Re: [Biojava-l] persistence - and the problems with it
Organization: This tangled web on which I'm laid intwined

On Wed, May 03, 2000 at 03:45:37PM +0100, Simon Brocklehurst wrote:
> >
> > 1) Java Serialisation is a very bad way of making objects persistent.
> 
> Agreed!  There are just sooooooo many bad things about Java serialization...

Agreed up to a point.  It's certainly no panacea, but, to be fair,
it works pretty well for a lot of cases where you want short-term
persistance for data from simple programs.  (Hey, it's got me out
of trouble plenty of times...).  I'd like to see everything in
Java that /can/ reasonably be serialized marked as Serializable
(for a start, that allows distributed biojava apps using RMI).

Of course, this isn't a reason for not developing more sophisticated
persistance mechanisms for the cases where they're more appropriate.

> You didn't discuss using XML representations of  biojava objects.  That
> might offer a reasonable way to allow a wide variety of types of user to
> exploit biojava. Once you have the XML you can do what you like with it...

XML is probably my preferred method for a lot of long-term/cross-application
persistance functions.  BioJava is already using XML a little bit:
take a look at the XmlMarkovModel class.  I expect this will grow,
but when there isn't an existing XML grammer which fulfils the
requirements of a particular BioJava object, a bit of care is needed
to create a new grammer which will be widely accepted.

Thomas.
-- 
There are whose study is of smells
And to attentive schools rehearse
How something mixed with something else
Makes something worse.


--__--__--

Message: 4
Date: Wed, 03 May 2000 18:39:39 +0200
From: Gerald Loeffler <Gerald.Loeffler@vienna.at>
Reply-To: Gerald.Loeffler@vienna.at 
To: biojava-l@biojava.org 
CC: Thomas Down <td2@sanger.ac.uk>,
Simon Brocklehurst <simon.brocklehurst@CambridgeAntibody.com>
Subject: Re: [Biojava-l] persistence - and the problems with it



Thomas Down wrote:
> 
> On Wed, May 03, 2000 at 03:45:37PM +0100, Simon Brocklehurst wrote:
> > >
> > > 1) Java Serialisation is a very bad way of making objects persistent.
> >
> > Agreed!  There are just sooooooo many bad things about Java serialization...
> 
> Agreed up to a point.  It's certainly no panacea, but, to be fair,
> it works pretty well for a lot of cases where you want short-term
> persistance for data from simple programs.  (Hey, it's got me out
> of trouble plenty of times...).  I'd like to see everything in
> Java that /can/ reasonably be serialized marked as Serializable
> (for a start, that allows distributed biojava apps using RMI).

absolutely - Java Serialisation is great for what it was intended,
namely the painless, short-term reading/writing of (few) objects from/to
a stream - like you need to do in RMI. It's absolutely no substitute for
a database, though.

> 
> Of course, this isn't a reason for not developing more sophisticated
> persistance mechanisms for the cases where they're more appropriate.
> 
> > You didn't discuss using XML representations of  biojava objects.  That
> > might offer a reasonable way to allow a wide variety of types of user to
> > exploit biojava. Once you have the XML you can do what you like with it...
> 
> XML is probably my preferred method for a lot of long-term/cross-application
> persistance functions.  BioJava is already using XML a little bit:
> take a look at the XmlMarkovModel class.  I expect this will grow,
> but when there isn't an existing XML grammer which fulfils the
> requirements of a particular BioJava object, a bit of care is needed
> to create a new grammer which will be widely accepted.

Regardless of all the niceties of XML, DTDs and handling DOMs from Java,
we have to face the fact that XML is in essence just another way of
defining flat file formats - fancy, easy-to-use file formats, granted,
but flat files nevertheless. As such, an XML representation of an object
graph suffers from many of the same drawbacks that other flat file
representations suffer from (especially in contrast to a database
representation of the same object graph): no datatypes (everything is a
string); no transaction safety (isolation of access); no query
capabilities against the data; ...

Additionally, XML representations tend to be verbous - so you need to
compress on the fly.

All this makes XML IMHO a very nice vehicle for the transient, portable,
platform-neutral representation of data (e.g. for database
import/export) but makes the idea of building a datastore of objects in
XML not really much more attractive than it would be in any other flat
file format.

Oh yes: and just as it is tedious to convert Java objects to/from a
relational database representation, it is tedious to convert Java
objects to/from an XML representation...

On the other hand - the world would be a better place if we just had
GenBank in an XML representation based on a good DTD (-:

	all the best,
	gerald

> 
> Thomas.
> --
> There are whose study is of smells
> And to attentive schools rehearse
> How something mixed with something else
> Makes something worse.

-- 
   Gerald.Loeffler@vienna.at _________________ Software Architect
   http://www.imp.univie.ac.at ____ http://www.daemonstration.com 
   OOA&D, Java, J2EE, JSP, Servlets, JavaBeans, ODBMS, RDBMS, XML



--__--__--

_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org 
http://biojava.org/mailman/listinfo/biojava-l 


End of Biojava-l Digest