[MOBY-dev] data by reference - a request for comments
Pieter Neerincx
pieter.neerincx at gmail.com
Wed Jul 30 13:51:54 UTC 2008
Hi Martin,
Had to think about the issues you raised below for a while...
On 28 Jul 2008, at 03:29, Martin Senger wrote:
> I am trying to figure out how I would implement the data by
> reference in
> order to achieve the main purpose - not no have all data in memory.
> I know
> how I would do it for the references being on the primitive types
> level, but
> it is less clear to me how I would do it with the references on other
> levels. Perhaps you can help me to explain how you would do it (or
> even
> Pieter can explore how he is already doing it)?
What I use mostly are http(s) references. I download them and save
them as tmp files in a tmp dir. What happens next completely depends
on the service.
Sometimes I'm lazy and simply use a DOM parser to load the entire
thing in memory for convenient access to all nodes. (Note that I'm
living in a Perl world and therefore this makes sense. Using a
reference prevents DOM parsing of the SOAP message by SOAP::Lite which
has an expansion factor 5-7 times worse than DOM parsing of only the
Moby payload.)
If I cannot afford to be lazy, because the Moby payload tends to be
big, I use hybrid parsers. I never use pure streaming (SAX) parsers.
Hybrid means I use a streaming parser to divide the input into certain
chunks/fragments. Next I loop over the chunks and load them completely
into memory. This way I only have one chunk at a time in memory plus a
few globals like for example a counter to keep track of the amount of
BLAST hits. This hybrid parsing can be either for XML or for legacy
data formats like BLAST reports in tabular format.
In one case I use an XSLT to parse and convert one type of Moby XML
message into an SVG (is also XML) image. The XSLT is pretty cool and
is also some kind of hybrid. Basically it works similar to a flat file
indexing system. You define certain nodes of interest. It parses the
XML (SAX not DOM) once to make an index of where these nodes occur in
the XML. This index is kept in memory, but that is a lot smaller since
it stores only where a certain node occurs not it's contents and it
stores only nodes you need and specified not everything. Next it
parses the XML again (SAX not DOM) to do your business logic.
The inconvenience of a pure streaming parser is that you cannot jump
around the data. Once the streaming has parsed and passed a certain
point it forgets about that data unless you decide to store it in
variables. So, if you parse 100 BLAST hits and when you parse number
88 you want to go back to compare it to number 12, you can not unless
you stored number 12 somehow. With the XSLT you don't have to store
it, as you could have indexed the BLAST hits. So, when you need 12
again the parser jumps back to that point and parses only the chunk/
fragment for number 12 again. This means some parts of the XML might
get parsed more than once, but for scalability that is much better
than parsing the entire XML several times or storing large pieces in
memory.
The disadvantage of XLSTs in my opinion is that it works radically
different than any language I've seen before. Perl and Java are
different, but they share many concepts like if, else, while, for
loops. Developing XSLTs requires a completely different way of
thinking and I never found it easy.
> For references on the primitive type level, I would do a usual
> parsing of a
> Moby message and when I find a reference, I would resolve it do a
> local file
> and I would pass to my service class the data as a file reference.
> It would
> be up to the service to read the file contents to the memory or not
> - that
> would depend what it needs to do with the data.
>
> However, on the higher level, such as mobyData level, the reference
> must be
> treated differently. If I do the same (resolving the reference to a
> local
> file), I still need to call a parser again to parse the contents of
> the
> local file. Because I still want to give my service already parsed
> data (and
> not a biomoby XML), and because I do not want to have all in memory,
> I need
> to create local files for each primitive type and do the same as above
> (passing my service local file references). I do not see much other
> choices.
Not only would you have to do that for the primitives. If you want to
provide pre-parsed data to a service you would have to do it for the
values of namespace and id attributes at any level of mobyData too
right? I assume we are talking about MoSeS here were based on the data
in BioMoby Central code is generated both to parse the Biomoby XML
inputs of a service as well as to compile its results as BioMoby XML.
Saving everything pre-parsed as tmp files to disk, doesn't make much
sense too me. If you would do it both for the entire message and for
all it's possible dissected parts it would cause quite some redundancy
and overhead.
But how to do it with a hybrid DOM/SAX parser or with an XSLT that
uses indices? You would have to generate the code to chunk the input
for a hybrid parser or you have to generate the code to index certain
nodes for the XSLT. This means you need to know how the business logic
of the services works to figure out at which level level of the XML
you would have to chunk or index. I know Mark has this famous slide
with the God of BioMoby, but I'm afraid even the almighty Martin
cannot predict based on the info in BioMoby Central how the business
logic of a services works :(... In the current MoSeS parsing of the
input and the business logic are two separate things, but if you want
to improve scalability with a hybrid SAX/DOM or XSLT parser, parsing
the input will become an essential part of the business logic. What
you could do is generate some disabled example code to chunk/index at
any level and then a developer could uncomment the lines required.
Furthermore there are some obvious levels at which chunking/indexing
makes sense. Take for example a huge Collection of Simples. It would
make sense to chunk/index at the level of the pseudo-root element of
the Simples. I don't think it's possible to pre-generate all code to
parse the input in the right order and with chunking/indexing at the
right levels, but with a few smart examples a developer should be able
to quickly change the order and/or adapt to chunking/indexing further
up or down the tree.
> Am I missing a point here?
No, making a typical "currency convertor" example web service with a
single small input and a single small output is easy, but making a web
service that scales well is really hard!
There is a Perl module in the BioMoby CVS for XSLT based parsing. It
was developed in France some time ago if I remember correctly and
AFAIK it is not maintained, so it probably lacks some of the newer
stuff like proper error handling, async services etc., but it might be
a nice example of what can be done with XSLTs.
Ok, those of you who managed to read all the way up till here earned a
beer on the next BioMoby developers meeting!
Cheers,
Pi
> The above is doable, of course. But you see why I wanted to have
> references
> only on the primitive type level. If we have them on any level (as
> it seems
> to be) I need, in my implementation, actually make another (local)
> references for the primitive type, anyway. Of course, it is better
> than the
> remote references because I do not need to make a network connection
> for
> each individual primitive type (point taken, Pieter) - but it still
> to be
> done.
>
> Cheers,
> Martin
>
> --
> Martin Senger
> email: martin.senger at gmail.com,m.senger at cgiar.org
> skype: martinsenger
> _______________________________________________
> MOBY-dev mailing list
> MOBY-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/moby-dev
-------------------------------------------------------------
Wageningen University and Research centre (WUR)
Laboratory of Bioinformatics
Transitorium (building 312) room 1034
Dreijenlaan 3
6703 HA Wageningen
The Netherlands
phone: +31 (0)317-483 060
mobile: +31 (0)6-143 66 783
e-mail: pieter.neerincx at gmail.com
skype: pieter.online
-------------------------------------------------------------
More information about the MOBY-dev
mailing list