[DAS] SOAP/XML: best way to marshall large objects

Thomas Down td2@sanger.ac.uk
Tue, 8 Jan 2002 16:26:48 +0000


On Tue, Jan 08, 2002 at 09:32:12AM -0500, Lincoln Stein wrote:
> 
> I have just spent some time with the Perl SOAPLite modules, and am wondering 
> how best to marshall large data structures in SOAP.  The case of concern is 
> the DAS features request, which in SOAPish form becomes the 
> getSequenceFeature service.

Hmmm, I presume you're really talking about the issue of
un-marshalling the data structures when they reach the client?

> This request returns a large number of features, typically thousands.  In the 
> Perl SOAP API, there are two alternative ways to handle this.  One processes 
> the entire response in one big swallow and returns an array of objects. 

Does this mean processing everything into SeqFeature-type
structures and returning them, or building a DOM-tree, then
allowing the application to walk over that?

My experience of DOM trees (in the first iteration of the BioJava
DAS client, and various other situations) is very, very, bad -- they
are expensive to build and tear down, and take far more memory than
either the XML text, or the `real' data structures which you are
representing.

If things are getting parsed into more optimal data structures,
then this option probably isn't too bad.

> The 
> other creates a getNextSequenceFeature iterator and calls the iterator 
> repeatedly to fetch each feature.

Do you mean creating an iterator on the server side and then making
a SOAP transation to fetch each iterator?  I think this is a bad
idea -- network latency can add up very quickly.  Part of the
reason I like SOAP vs. CORBA is that SOAP is (arguably) better
at moving large amounts of complex data over the network in a
single operation.

[One other possibility -- you could write iterators which return
blocks, rather than single features.  This is what BioCORBA does.
I'm not convinced this is a great idea (at least not in the SOAP
world), but it's something to think about].

Extra issue with iterators: it makes your servers stateful.
Something which you should think quite carefully about before
committing yourself.

> Neither method is particularly optimal.  The first wastes a lot of time 
> parsing the XML stream and eating up memory before it returns a result that 
> the user code can work with.  The second makes a network method call for each 
> feature, killing the transaction with latency delays.
> 
> Of course, I can also get the XML stream from the SOAP request, pass it to a 
> parser, and invoke callbacks, and perhaps this is the best way to do it.  But 
> I wonder what experience and advice people who are experienced with the Java 
> APIs have on this before I jump to the wrong conclusions.

That's actually what's going on under the hood in my code.
But in practice, once people are using the BioJava interfaces,
the threads which originally requested the data just end up
blocking until it's all been parsed, since most code wants to
be able to access the complete set of features it's requested.

The main reason I'm a fan of a streaming approach is because
it gets the data as quickly as possible into dedicated data
structures, rather than having to hold it in memory as XML
text (moderately bloated) or DOM trees (huge).  The time spent
parsing the data is generally going to be small compared to the
time spend pushing it over the network.

Hope this gives you a useful datapoint, anyway.

   Thomas.