[Biojava-l] BioJava discussion board
Thomas Down
td2@sanger.ac.uk
Wed, 28 Aug 2002 10:55:42 +0100
Hi Brian
On Wed, Aug 28, 2002 at 12:09:05AM -0400, Brian Gilman wrote:
>
> BioJava does not work well in a distributed environment in terms
> of RMI calls or in the "weservices" stack. Custom
> serializers/deserializers need to be made for each and every object that
> exists in the feature heirarchy. This is painful to say the least. T
>
> Where's the contructor!! There are a lot of factories that make,
> while making client side programming very easy to do, kill a middleware
> guy like myself.
I think you're right about this being the main point of impedence
mismatch between BioJava and Web Services (or other distributed
object systems, for that matter).
What I'm not so certain about is how easily it is fixable. There's
certainly more to this than just adding constructors. The basic
web-services serialization system works well for objects which fit
closely with the Javabeans model. For instance:
public class Employee {
public String getName();
public void setName(String newName);
public OrganizationalUnit getDepartment();
public void setDepartment(OrganizationalUnit newDepartment);
}
In fact, that kind of example seems to be precisely what a lot
of the developers of web-services had in mind.
I'm going to concentrate on the BioJava Sequence interface and
related stuff, since that's the bit most people are familiar
with, and it's also one of the most problematic parts from
a distribution point of view.
Simply adding a constructor and some javabeany mutator methods
to SimpleSequence won't fix anything -- your SOAP toolkit
(or whatever) still won't understand how to get at the Symbol
or Feature objects (since these need to be iterated). And even
if it could access the Symbols as an array (or whatever), the
default SOAP-ENC serialization of these will be quite hideously
inefficient.
To make a Sequence object which genuinely plays nicely with
SOAP (and other distributed object and persistance technologies)
you're going to end up with something looking like:
public class Sequence() {
public Sequence();
public String getSeqString();
public void setSeqString(String seq);
public String getName();
public void setName(String name);
public Feature[] getFeatures();
public void setFeatures(Feature[] features);
}
This will, of course, SOAP-ENC trivially. But whether anyone would
really like to program with this is another matter -- I for one
would prefer something that looks more like the current BioJava
interface.
In the `data blob' world, it's also far harder to impose conditions
like "Features must fit onto the Sequence to which they're attached".
A lot of the factory-patterns in BioJava are there principally to
ensure data integrity.
The final issue is that, arguably, a lot of serialization belongs
on interfaces rather than classes. Suppose I get a sequence from
the biojava-ensembl package. It'll be an implementation of the
class EnsemblContigSequence, which isn't even a public class, let
alone has a public constructor. Attached, it has lots of ensembl-specific
Feature implementations (which are also package-private, of course).
Behind the scenes things are even worse from a serialization point
of view -- lots of lazy fetching, and data caches which are maintained
by the containing EnsemblSequenceDB object rather than the Sequence
itself.
If I pass this object into a serializer, what I want to come out
probably isn't a detailed description of the guts of that particular
EnsemblContigSequence object -- the client machine might not even
have biojava-ensembl at all. I'd rather just serialize everything
in a generic way, and re-create everything on the client as a
SimpleSequence (or whatever). Does this make sense?
So what solutions do we have?
1. Come up with an `over-the-wire' API which is based on
data-blobs (like the example Sequence class, above) rather
than complex interfaces and factories. It'll be easy to
bridge from this back to something more like the current
BioJava client API, which remains a client-oriented API.
2. Bite the bullet and write custom serializers/deserializers
to transform between BioJava and some reasonably neutral
XML representations (which could be shared with Omnigene
and other projects). I know doing this sucks, but at the
end of the day, there aren't /that/ many data types which
need to be shuffled around. It might be the easiest option
to get some really compelling web services up and working
with BioJava.
3? Come up with some scheme of metadata which allows the semantics
of BioJava (and other) interfaces to be defined in enough
detail that they can be serialized and deserialized automatically.
This is really quite similar to option 2, except with a different
language. I'd guess this is the hardest option, but much, in
the future, also be useful for other things -- e.g. auto-generating
database adaptors.
And I guess the final alternative...
4. Take a completely different approach, declare the `interface
oriented' BioJava 1 and interesting experiment, and design
BioJava2 in a more `data-blob' fashion.
Personally, I think (4) would be a great shame. But it undoubtably
would make supporting distributed systems, and using fully-automatic
object persistance solutions, a whole lot easier. So I guess it's
something we should discuss.
Thomas.
PS. I think there's a lot in common between the issues discussed here,
and the question of UML class diagrams which came up on the
discussion board yesterday. UML is also more comfortable with
the `data blob' way of doing things. When I wrote the two example
classes in this message, I realized that it would have been
easier to write them in UML than Java. This is not true of most
of the BioJava interfaces.