[MOBY-dev] Question on parser -> Big XML documents
Pieter Neerincx
Pieter.Neerincx at wur.nl
Tue Sep 6 16:30:46 UTC 2005
Hi,
I have some services that query databases. The result can be nothing,
a single object, or it can be several thousand objects.... I was also
running into trouble with big XML documents. I'm using the Perl API,
which uses SOAP::Lite, which uses XML::LibXML. SOAP::Lite gets the
job done for small xml structures, but for big ones it's a mess.
Firstly, SOAP::Lite loads the entire message in memory as one big
piece (hence no chunks or streams etc.). Secondly, if you use
Data::Dumper to have a look at the perl data structures that are
built, you will see that the same info is copied two, three or more
times. There's quite a bit of redundancy in there. As a result the
expansion factor for parsing xml by SOAP::lite is between 10 and 13
(according to people on the SOAP::Lite mailing list). That means a 10
MB xml document will become 100-130 MB in memory. Several clients
accessing several of these services at the same time will simply
bring our servers on their knees :(. If there are people on the
mailinglist with experience in handling laaaaaarge inputs and/or
outputs I'd really appreciate it if you drop a few lines...
So far I have looked at working with attachments. Not really an
option with Perl. Combining SOAP::Lite with MIME::Tools is a buggy
combo. xsltproc sounds good. I currently changed my services to send
only a pointer (URL) as result which the client has to fetch. For a
quick and dirty workaround it works beautifully, but from a design
point of view it bad bad bad :( ...
Cheers,
Pi
On 31-Aug-2005, at 8:46 AM, Sebastien Carrere wrote:
> The MOBY message that I wanted to parse was a 12 Megabyte one.
> The web-service concerned is:
>
> name: ImgaGetTigrXMLEntriesFromKeyword
> uri: bioinfo.genopole-toulouse.prd.fr
> input: String
> Output(s): /Collection of /text-xml, as TIGRXML and /Collection of /
> IMGA_Accession, as IMGA_Accession
>
> I think this is a little bit extreme, but it works fine now.
>
> Sebastien
>
> Chunyan Wang wrote:
>
>
>> Hi,
>> I changed TimeOut from default to 50000 in the Apache config to
>> fix timeout problem.
>> How big was your XML file when you had problem?
>> Cheers,
>>
>> Joyce
>>
>> Sebastien Carrere wrote:
>>
>>
>>> Hi all,
>>>
>>> I got the same problem when I wanted to parse huge XML files.
>>> That's why I have written a clone of CommonSub.pm using
>>> "xsltproc" to parse MOBY message.
>>> Then the parsing time problem was removed.
>>>
>>> However, how do you fixed timeout problem ?
>>>
>>> Sebastien
>>>
>>> Chunyan Wang wrote:
>>>
>>>
>>>>
>>>>
>>>> Martin Senger wrote:
>>>>
>>>>
>>>>>> Could anybody explain this "problem" to me? Thanks.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> What language are you using, what XML library in that language?
>>>>>
>>>>>
>>>> I am using Perl and XML::DOM. I am using
>>>> "genericServiceInputParser($data)" to parse the input sequence
>>>> in my service.
>>>> By the way, I want to let you know I fixed timeout problem.
>>>> Thanks for your suggestion.
>>>>
>>>> Joyce
>>>>
>>>>
>>>>> Martin
>>>>>
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> MOBY-dev mailing list
>>>> MOBY-dev at biomoby.org
>>>> http://www.biomoby.org/mailman/listinfo/moby-dev
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>> _______________________________________________
>> MOBY-dev mailing list
>> MOBY-dev at biomoby.org
>> http://www.biomoby.org/mailman/listinfo/moby-dev
>>
>>
>>
>
> --
> __________________________________________________________
>
> Sebastien CARRERE LIPM (INRA-CNRS)
> B.P.52627 -- 31326 CASTANET TOLOSAN
> tel:(33) 5-61-28-53-29
> fax:(33) 5-61-28-50-61
>
>
> _______________________________________________
> MOBY-dev mailing list
> MOBY-dev at biomoby.org
> http://www.biomoby.org/mailman/listinfo/moby-dev
>
Wageningen University and Research centre (WUR)
Laboratory of Bioinformatics
Transitorium (building 312) room 1034
Dreijenlaan 3
6703 HA Wageningen
The Netherlands
phone: 0317-483 060
fax: 0317-483 584
mobile: 06-143 66 783
pieter.neerincx at wur.nl
More information about the MOBY-dev
mailing list