[moby] Re: [MOBY-dev] Question on parser -> Big XML documents

Mark Wilkinson markw at illuminae.com
Tue Sep 6 18:05:19 UTC 2005


This is indeed still an issue, and you are right about the pain of using
SOAP::Lite + MIME::Tools in Perl (though I know that is soon going to be
better, since we are now using an apparently stable combination of
these, plus SOAP attachemnts, for our own LSID resolver!  However these
are not available on CPAN yet AFAIK).

I recall that Lincoln S. wrote to Paul K. several years ago asking if it
would ever be possible to swap-out the DOM parser in SOAP::Lite for a
SAX parser in order to overcome this limitation (and also with an eye to
streaming responses...), but I don't think this even made it on to the
SOAP::Lite radar so I doubt that the solution is going to come from that
community anytime soon.

So... I can't advise anything, but perhaps others in the MOBY community
can!

M


On Tue, 2005-09-06 at 18:30 +0200, Pieter Neerincx wrote:
> Hi,
> 
> I have some services that query databases. The result can be nothing,  
> a single object, or it can be several thousand objects.... I was also  
> running into trouble with big XML documents. I'm using the Perl API,  
> which uses SOAP::Lite, which uses XML::LibXML. SOAP::Lite gets the  
> job done for small xml structures, but for big ones it's a mess.  
> Firstly, SOAP::Lite loads the entire message in memory as one big  
> piece (hence no chunks or streams etc.). Secondly, if you use  
> Data::Dumper to have a look at the perl data structures that are  
> built, you will see that the same info is copied two, three or more  
> times. There's quite a bit of redundancy in there. As a result the  
> expansion factor for parsing xml by SOAP::lite is between 10 and 13  
> (according to people on the SOAP::Lite mailing list). That means a 10  
> MB xml document will become 100-130 MB in memory. Several clients  
> accessing several of these services at the same time will simply  
> bring our servers on their knees :(. If there are people on the  
> mailinglist with experience in handling laaaaaarge inputs and/or  
> outputs I'd really appreciate it if you drop a few lines...
> 
> So far I have looked at working with attachments. Not really an  
> option with Perl. Combining SOAP::Lite with MIME::Tools is a buggy  
> combo. xsltproc sounds good. I currently changed my services to send  
> only a pointer (URL) as result which the client has to fetch. For a  
> quick and dirty workaround it works beautifully, but from a design  
> point of view it bad bad bad :( ...
> 
> Cheers,
> 
> Pi
> 
> 
> On 31-Aug-2005, at 8:46 AM, Sebastien Carrere wrote:
> 
> > The MOBY message that I wanted to parse was a 12 Megabyte one.
> > The web-service concerned is:
> >
> > name: ImgaGetTigrXMLEntriesFromKeyword
> > uri: bioinfo.genopole-toulouse.prd.fr
> > input: String
> > Output(s): /Collection of /text-xml, as TIGRXML and /Collection of / 
> > IMGA_Accession, as IMGA_Accession
> >
> > I think this is a little bit extreme, but it works fine now.
> >
> > Sebastien
> >
> > Chunyan Wang wrote:
> >
> >
> >> Hi,
> >> I changed TimeOut from default to 50000 in the Apache config to  
> >> fix timeout problem.
> >> How big was your XML file when you had problem?
> >> Cheers,
> >>
> >> Joyce
> >>
> >> Sebastien Carrere wrote:
> >>
> >>
> >>> Hi all,
> >>>
> >>> I got the same problem when I wanted to parse huge XML files.
> >>> That's why I have written a clone of CommonSub.pm using  
> >>> "xsltproc" to parse MOBY message.
> >>> Then the parsing time problem was removed.
> >>>
> >>> However, how do you fixed timeout problem ?
> >>>
> >>> Sebastien
> >>>
> >>> Chunyan Wang wrote:
> >>>
> >>>
> >>>>
> >>>>
> >>>> Martin Senger wrote:
> >>>>
> >>>>
> >>>>>> Could anybody explain this "problem" to me? Thanks.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>   What language are you using, what XML library in that language?
> >>>>>
> >>>>>
> >>>> I am using Perl and XML::DOM. I am using  
> >>>> "genericServiceInputParser($data)" to parse the input sequence  
> >>>> in my service.
> >>>> By the way, I want to let you know I fixed timeout problem.  
> >>>> Thanks for your suggestion.
> >>>>
> >>>> Joyce
> >>>>
> >>>>
> >>>>>   Martin
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>> _______________________________________________
> >>>> MOBY-dev mailing list
> >>>> MOBY-dev at biomoby.org
> >>>> http://www.biomoby.org/mailman/listinfo/moby-dev
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >> _______________________________________________
> >> MOBY-dev mailing list
> >> MOBY-dev at biomoby.org
> >> http://www.biomoby.org/mailman/listinfo/moby-dev
> >>
> >>
> >>
> >
> > -- 
> > __________________________________________________________
> >
> > Sebastien CARRERE                        LIPM (INRA-CNRS)
> >                      B.P.52627 -- 31326 CASTANET TOLOSAN
> > tel:(33) 5-61-28-53-29
> > fax:(33) 5-61-28-50-61
> >
> >
> > _______________________________________________
> > MOBY-dev mailing list
> > MOBY-dev at biomoby.org
> > http://www.biomoby.org/mailman/listinfo/moby-dev
> >
> 
> 
> Wageningen University and Research centre (WUR)
> Laboratory of Bioinformatics
> Transitorium (building 312) room 1034
> Dreijenlaan 3
> 6703 HA Wageningen
> The Netherlands
> phone: 0317-483 060
> fax: 0317-483 584
> mobile: 06-143 66 783
> pieter.neerincx at wur.nl
> 
> 
> 
> _______________________________________________
> MOBY-dev mailing list
> MOBY-dev at biomoby.org
> http://www.biomoby.org/mailman/listinfo/moby-dev
-- 
"Ontologists do it with the edges!"

Mark Wilkinson
Asst. Professor
Dept. of Medical Genetics
University of British Columbia
PI in Bioinformatics
iCAPTURE Centre
St. Paul's Hospital
Rm. 166, 1081 Burrard St.
Vancouver, BC, V6Z 1Y6
tel: 604 682 2344 x62129
fax: 604 806 9274




More information about the MOBY-dev mailing list