[MOBY-dev] data by reference - a request for comments

Wed Jul 30 13:51:54 UTC 2008

Hi Martin,

Had to think about the issues you raised below for a while...

On 28 Jul 2008, at 03:29, Martin Senger wrote:

> I am trying to figure out how I would implement the data by  
> reference in
> order to achieve the main purpose - not no have all data in memory.  
> I know
> how I would do it for the references being on the primitive types  
> level, but
> it is less clear to me how I would do it with the references on other
> levels. Perhaps you can help me to explain how you would do it (or  
> even
> Pieter can explore how he is already doing it)?

What I use mostly are http(s) references. I download them and save  
them as tmp files in a tmp dir. What happens next completely depends  
on the service.

Sometimes I'm lazy and simply use a DOM parser to load the entire  
thing in memory for convenient access to all nodes. (Note that I'm  
living in a Perl world and therefore this makes sense. Using a  
reference prevents DOM parsing of the SOAP message by SOAP::Lite which  
has an expansion factor 5-7 times worse than DOM parsing of only the  
Moby payload.)

If I cannot afford to be lazy, because the Moby payload tends to be  
big, I use hybrid parsers. I never use pure streaming (SAX) parsers.  
Hybrid means I use a streaming parser to divide the input into certain  
chunks/fragments. Next I loop over the chunks and load them completely  
into memory. This way I only have one chunk at a time in memory plus a  
few globals like for example a counter to keep track of the amount of  
BLAST hits. This hybrid parsing can be either for XML or for legacy  
data formats like BLAST reports in tabular format.

In one case I use an XSLT to parse and convert one type of Moby XML  
message into an SVG (is also XML) image. The XSLT is pretty cool and  
is also some kind of hybrid. Basically it works similar to a flat file  
indexing system. You define certain nodes of interest. It parses the  
XML (SAX not DOM) once to make an index of where these nodes occur in  
the XML. This index is kept in memory, but that is a lot smaller since  
it stores only where a certain node occurs not it's contents and it  
stores only nodes you need and specified not everything. Next it  
parses the XML again (SAX not DOM) to do your business logic.

The inconvenience of a pure streaming parser is that you cannot jump  
around the data. Once the streaming has parsed and passed a certain  
point it forgets about that data unless you decide to store it in  
variables. So, if you parse 100 BLAST hits and when you parse number  
88 you want to go back to compare it to number 12, you can not unless  
you stored number 12 somehow. With the XSLT you don't have to store  
it, as you could have indexed the BLAST hits. So, when you need 12  
again the parser jumps back to that point and parses only the chunk/ 
fragment for number 12 again. This means some parts of the XML might  
get parsed more than once, but for scalability that is much better  
than parsing the entire XML several times or storing large pieces in  
memory.

The disadvantage of XLSTs in my opinion is that it works radically  
different than any language I've seen before. Perl and Java are  
different, but they share many concepts like if, else, while, for  
loops. Developing XSLTs requires a completely different way of  
thinking and I never found it easy.

> For references on the primitive type level, I would do a usual  
> parsing of a
> Moby message and when I find a reference, I would resolve it do a  
> local file
> and I would pass to my service class the data as a file reference.  
> It would
> be up to the service to read the file contents to the memory or not  
> - that
> would depend what it needs to do with the data.
>
> However, on the higher level, such as mobyData level, the reference  
> must be
> treated differently. If I do the same (resolving the reference to a  
> local
> file), I still need to call a parser again to parse the contents of  
> the
> local file. Because I still want to give my service already parsed  
> data (and
> not a biomoby XML), and because I do not want to have all in memory,  
> I need
> to create local files for each primitive type and do the same as above
> (passing my service local file references). I do not see much other  
> choices.

Not only would you have to do that for the primitives. If you want to  
provide pre-parsed data to a service you would have to do it for the  
values of namespace and id attributes at any level of mobyData too  
right? I assume we are talking about MoSeS here were based on the data  
in BioMoby Central code is generated both to parse the Biomoby XML  
inputs of a service as well as to compile its results as BioMoby XML.  
Saving everything pre-parsed as tmp files to disk, doesn't make much  
sense too me. If you would do it both for the entire message and for  
all it's possible dissected parts it would cause quite some redundancy  
and overhead.

But how to do it with a hybrid DOM/SAX parser or with an XSLT that  
uses indices? You would have to generate the code to chunk the input  
for a hybrid parser or you have to generate the code to index certain  
nodes for the XSLT. This means you need to know how the business logic  
of the services works to figure out at which level level of the XML  
you would have to chunk or index. I know Mark has this famous slide  
with the God of BioMoby, but I'm afraid even the almighty Martin  
cannot predict based on the info in BioMoby Central how the business  
logic of a services works :(... In the current MoSeS parsing of the  
input and the business logic are two separate things, but if you want  
to improve scalability with a hybrid SAX/DOM or XSLT parser, parsing  
the input will become an essential part of the business logic. What  
you could do is generate some disabled example code to chunk/index at  
any level and then a developer could uncomment the lines required.  
Furthermore there are some obvious levels at which chunking/indexing  
makes sense. Take for example a huge Collection of Simples. It would  
make sense to chunk/index at the level of the pseudo-root element of  
the Simples. I don't think it's possible to pre-generate all code to  
parse the input in the right order and with chunking/indexing at the  
right levels, but with a few smart examples a developer should be able  
to quickly change the order and/or adapt to chunking/indexing further  
up or down the tree.

> Am I missing a point here?

No, making a typical "currency convertor" example web service with a  
single small input and a single small output is easy, but making a web  
service that scales well is really hard!

There is a Perl module in the BioMoby CVS for XSLT based parsing. It  
was developed in France some time ago if I remember correctly and  
AFAIK it is not maintained, so it probably lacks some of the newer  
stuff like proper error handling, async services etc., but it might be  
a nice example of what can be done with XSLTs.

Ok, those of you who managed to read all the way up till here earned a  
beer on the next BioMoby developers meeting!

Cheers,

Pi

> The above is doable, of course. But you see why I wanted to have  
> references
> only on the primitive type level. If we have them on any level (as  
> it seems
> to be) I need, in my implementation, actually make another (local)
> references for the primitive type, anyway. Of course, it is better  
> than the
> remote references because I do not need to make a network connection  
> for
> each individual primitive type (point taken, Pieter) - but it still  
> to be
> done.
>
> Cheers,
> Martin
>
> -- 
> Martin Senger
> email: martin.senger at gmail.com,m.senger at cgiar.org
> skype: martinsenger
> _______________________________________________
> MOBY-dev mailing list
> MOBY-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/moby-dev

-------------------------------------------------------------
Wageningen University and Research centre (WUR)
Laboratory of Bioinformatics
Transitorium (building 312) room 1034

Dreijenlaan 3
6703 HA Wageningen
The Netherlands

phone:  +31 (0)317-483 060
mobile: +31 (0)6-143 66 783
e-mail: pieter.neerincx at gmail.com
skype:  pieter.online
-------------------------------------------------------------