[MOBY-dev] Question on parser -> Big XML documents

Pieter Neerincx Pieter.Neerincx at wur.nl
Tue Sep 6 16:30:46 UTC 2005


Hi,

I have some services that query databases. The result can be nothing,  
a single object, or it can be several thousand objects.... I was also  
running into trouble with big XML documents. I'm using the Perl API,  
which uses SOAP::Lite, which uses XML::LibXML. SOAP::Lite gets the  
job done for small xml structures, but for big ones it's a mess.  
Firstly, SOAP::Lite loads the entire message in memory as one big  
piece (hence no chunks or streams etc.). Secondly, if you use  
Data::Dumper to have a look at the perl data structures that are  
built, you will see that the same info is copied two, three or more  
times. There's quite a bit of redundancy in there. As a result the  
expansion factor for parsing xml by SOAP::lite is between 10 and 13  
(according to people on the SOAP::Lite mailing list). That means a 10  
MB xml document will become 100-130 MB in memory. Several clients  
accessing several of these services at the same time will simply  
bring our servers on their knees :(. If there are people on the  
mailinglist with experience in handling laaaaaarge inputs and/or  
outputs I'd really appreciate it if you drop a few lines...

So far I have looked at working with attachments. Not really an  
option with Perl. Combining SOAP::Lite with MIME::Tools is a buggy  
combo. xsltproc sounds good. I currently changed my services to send  
only a pointer (URL) as result which the client has to fetch. For a  
quick and dirty workaround it works beautifully, but from a design  
point of view it bad bad bad :( ...

Cheers,

Pi


On 31-Aug-2005, at 8:46 AM, Sebastien Carrere wrote:

> The MOBY message that I wanted to parse was a 12 Megabyte one.
> The web-service concerned is:
>
> name: ImgaGetTigrXMLEntriesFromKeyword
> uri: bioinfo.genopole-toulouse.prd.fr
> input: String
> Output(s): /Collection of /text-xml, as TIGRXML and /Collection of / 
> IMGA_Accession, as IMGA_Accession
>
> I think this is a little bit extreme, but it works fine now.
>
> Sebastien
>
> Chunyan Wang wrote:
>
>
>> Hi,
>> I changed TimeOut from default to 50000 in the Apache config to  
>> fix timeout problem.
>> How big was your XML file when you had problem?
>> Cheers,
>>
>> Joyce
>>
>> Sebastien Carrere wrote:
>>
>>
>>> Hi all,
>>>
>>> I got the same problem when I wanted to parse huge XML files.
>>> That's why I have written a clone of CommonSub.pm using  
>>> "xsltproc" to parse MOBY message.
>>> Then the parsing time problem was removed.
>>>
>>> However, how do you fixed timeout problem ?
>>>
>>> Sebastien
>>>
>>> Chunyan Wang wrote:
>>>
>>>
>>>>
>>>>
>>>> Martin Senger wrote:
>>>>
>>>>
>>>>>> Could anybody explain this "problem" to me? Thanks.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>   What language are you using, what XML library in that language?
>>>>>
>>>>>
>>>> I am using Perl and XML::DOM. I am using  
>>>> "genericServiceInputParser($data)" to parse the input sequence  
>>>> in my service.
>>>> By the way, I want to let you know I fixed timeout problem.  
>>>> Thanks for your suggestion.
>>>>
>>>> Joyce
>>>>
>>>>
>>>>>   Martin
>>>>>
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> MOBY-dev mailing list
>>>> MOBY-dev at biomoby.org
>>>> http://www.biomoby.org/mailman/listinfo/moby-dev
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>> _______________________________________________
>> MOBY-dev mailing list
>> MOBY-dev at biomoby.org
>> http://www.biomoby.org/mailman/listinfo/moby-dev
>>
>>
>>
>
> -- 
> __________________________________________________________
>
> Sebastien CARRERE                        LIPM (INRA-CNRS)
>                      B.P.52627 -- 31326 CASTANET TOLOSAN
> tel:(33) 5-61-28-53-29
> fax:(33) 5-61-28-50-61
>
>
> _______________________________________________
> MOBY-dev mailing list
> MOBY-dev at biomoby.org
> http://www.biomoby.org/mailman/listinfo/moby-dev
>


Wageningen University and Research centre (WUR)
Laboratory of Bioinformatics
Transitorium (building 312) room 1034
Dreijenlaan 3
6703 HA Wageningen
The Netherlands
phone: 0317-483 060
fax: 0317-483 584
mobile: 06-143 66 783
pieter.neerincx at wur.nl






More information about the MOBY-dev mailing list