[Biojava-l] BioJava discussion board

Thomas Down td2@sanger.ac.uk
Wed, 28 Aug 2002 10:55:42 +0100


Hi Brian

On Wed, Aug 28, 2002 at 12:09:05AM -0400, Brian Gilman wrote:
>
> 	BioJava does not work well in a distributed environment in terms
> of RMI calls or in the "weservices" stack. Custom
> serializers/deserializers need to be made for each and every object that
> exists in the feature heirarchy. This is painful to say the least. T
> 
> 	Where's the contructor!! There are a lot of factories that make,
> while making client side programming very easy to do, kill a middleware
> guy like myself. 

I think you're right about this being the main point of impedence
mismatch between BioJava and Web Services (or other distributed
object systems, for that matter).

What I'm not so certain about is how easily it is fixable.  There's
certainly more to this than just adding constructors.  The basic
web-services serialization system works well for objects which fit
closely with the Javabeans model.  For instance:

    public class Employee {
       public String getName();
       public void setName(String newName);
       public OrganizationalUnit getDepartment();
       public void setDepartment(OrganizationalUnit newDepartment);
    }

In fact, that kind of example seems to be precisely what a lot
of the developers of web-services had in mind.


I'm going to concentrate on the BioJava Sequence interface and
related stuff, since that's the bit most people are familiar
with, and it's also one of the most problematic parts from
a distribution point of view.

Simply adding a constructor and some javabeany mutator methods
to SimpleSequence won't fix anything -- your SOAP toolkit
(or whatever) still won't understand how to get at the Symbol
or Feature objects (since these need to be iterated).  And even
if it could access the Symbols as an array (or whatever), the
default SOAP-ENC serialization of these will be quite hideously
inefficient.

To make a Sequence object which genuinely plays nicely with
SOAP (and other distributed object and persistance technologies)
you're going to end up with something looking like:

   public class Sequence() {
     public Sequence();
     public String getSeqString();
     public void setSeqString(String seq);
     public String getName();
     public void setName(String name);
     public Feature[] getFeatures();
     public void setFeatures(Feature[] features);
   }

This will, of course, SOAP-ENC trivially.  But whether anyone would
really like to program with this is another matter -- I for one
would prefer something that looks more like the current BioJava
interface.

In the `data blob' world, it's also far harder to impose conditions
like "Features must fit onto the Sequence to which they're attached".
A lot of the factory-patterns in BioJava are there principally to
ensure data integrity.

The final issue is that, arguably, a lot of serialization belongs
on interfaces rather than classes.  Suppose I get a sequence from
the biojava-ensembl package.  It'll be an implementation of the
class EnsemblContigSequence, which isn't even a public class, let
alone has a public constructor.  Attached, it has lots of ensembl-specific
Feature implementations (which are also package-private, of course).
Behind the scenes things are even worse from a serialization point
of view -- lots of lazy fetching, and data caches which are maintained
by the containing EnsemblSequenceDB object rather than the Sequence
itself.

If I pass this object into a serializer, what I want to come out
probably isn't a detailed description of the guts of that particular
EnsemblContigSequence object -- the client machine might not even
have biojava-ensembl at all.  I'd rather just serialize everything
in a generic way, and re-create everything on the client as a
SimpleSequence (or whatever).  Does this make sense?




So what solutions do we have?

   1. Come up with an `over-the-wire' API which is based on
      data-blobs (like the example Sequence class, above) rather
      than complex interfaces and factories.  It'll be easy to
      bridge from this back to something more like the current
      BioJava client API, which remains a client-oriented API.

   2. Bite the bullet and write custom serializers/deserializers
      to transform between BioJava and some reasonably neutral
      XML representations (which could be shared with Omnigene
      and other projects).  I know doing this sucks, but at the
      end of the day, there aren't /that/ many data types which
      need to be shuffled around.  It might be the easiest option
      to get some really compelling web services up and working
      with BioJava.

   3? Come up with some scheme of metadata which allows the semantics
      of BioJava (and other) interfaces to be defined in enough
      detail that they can be serialized and deserialized automatically.
      This is really quite similar to option 2, except with a different
      language.  I'd guess this is the hardest option, but much, in
      the future, also be useful for other things -- e.g. auto-generating
      database adaptors.


And I guess the final alternative...

    4. Take a completely different approach, declare the `interface
       oriented' BioJava 1 and interesting experiment, and design
       BioJava2 in a more `data-blob' fashion.



Personally, I think (4) would be a great shame.  But it undoubtably
would make supporting distributed systems, and using fully-automatic
object persistance solutions, a whole lot easier.  So I guess it's
something we should discuss.

     Thomas.



PS. I think there's a lot in common between the issues discussed here,
    and the question of UML class diagrams which came up on the
    discussion board yesterday.  UML is also more comfortable with
    the `data blob' way of doing things.  When I wrote the two example
    classes in this message, I realized that it would have been
    easier to write them in UML than Java.  This is not true of most
    of the BioJava interfaces.