[moby] Re: [MOBY-dev] Question on parser -> Big XML documents

Pieter Neerincx Pieter.Neerincx at wur.nl
Wed Sep 7 11:12:56 UTC 2005


On 6-Sep-2005, at 8:05 PM, Mark Wilkinson wrote:

> This is indeed still an issue, and you are right about the pain of  
> using
> SOAP::Lite + MIME::Tools in Perl (though I know that is soon going  
> to be
> better, since we are now using an apparently stable combination of
> these, plus SOAP attachemnts, for our own LSID resolver!  However  
> these
> are not available on CPAN yet AFAIK).

Ok, so there is a combination that really works :). Could you please  
tell me which version of SOAP::Lite and MIME::Tools you are mixing to  
make SOAP with attachments work?


>
> I recall that Lincoln S. wrote to Paul K. several years ago asking  
> if it
> would ever be possible to swap-out the DOM parser in SOAP::Lite for a
> SAX parser in order to overcome this limitation (and also with an  
> eye to
> streaming responses...), but I don't think this even made it on to the
> SOAP::Lite radar so I doubt that the solution is going to come from  
> that
> community anytime soon.

I doubt that as well. If I find some solution to streaming the SOAP  
XML I'll post it to the list...

Thanks,

Pieter

>
> So... I can't advise anything, but perhaps others in the MOBY  
> community
> can!
>
> M
>
>
> On Tue, 2005-09-06 at 18:30 +0200, Pieter Neerincx wrote:
>
>> Hi,
>>
>> I have some services that query databases. The result can be nothing,
>> a single object, or it can be several thousand objects.... I was also
>> running into trouble with big XML documents. I'm using the Perl API,
>> which uses SOAP::Lite, which uses XML::LibXML. SOAP::Lite gets the
>> job done for small xml structures, but for big ones it's a mess.
>> Firstly, SOAP::Lite loads the entire message in memory as one big
>> piece (hence no chunks or streams etc.). Secondly, if you use
>> Data::Dumper to have a look at the perl data structures that are
>> built, you will see that the same info is copied two, three or more
>> times. There's quite a bit of redundancy in there. As a result the
>> expansion factor for parsing xml by SOAP::lite is between 10 and 13
>> (according to people on the SOAP::Lite mailing list). That means a 10
>> MB xml document will become 100-130 MB in memory. Several clients
>> accessing several of these services at the same time will simply
>> bring our servers on their knees :(. If there are people on the
>> mailinglist with experience in handling laaaaaarge inputs and/or
>> outputs I'd really appreciate it if you drop a few lines...
>>
>> So far I have looked at working with attachments. Not really an
>> option with Perl. Combining SOAP::Lite with MIME::Tools is a buggy
>> combo. xsltproc sounds good. I currently changed my services to send
>> only a pointer (URL) as result which the client has to fetch. For a
>> quick and dirty workaround it works beautifully, but from a design
>> point of view it bad bad bad :( ...
>>
>> Cheers,
>>
>> Pi
>>
>>
>> On 31-Aug-2005, at 8:46 AM, Sebastien Carrere wrote:
>>
>>
>>> The MOBY message that I wanted to parse was a 12 Megabyte one.
>>> The web-service concerned is:
>>>
>>> name: ImgaGetTigrXMLEntriesFromKeyword
>>> uri: bioinfo.genopole-toulouse.prd.fr
>>> input: String
>>> Output(s): /Collection of /text-xml, as TIGRXML and /Collection of /
>>> IMGA_Accession, as IMGA_Accession
>>>
>>> I think this is a little bit extreme, but it works fine now.
>>>
>>> Sebastien
>>>
>>> Chunyan Wang wrote:
>>>
>>>
>>>
>>>> Hi,
>>>> I changed TimeOut from default to 50000 in the Apache config to
>>>> fix timeout problem.
>>>> How big was your XML file when you had problem?
>>>> Cheers,
>>>>
>>>> Joyce
>>>>
>>>> Sebastien Carrere wrote:
>>>>
>>>>
>>>>
>>>>> Hi all,
>>>>>
>>>>> I got the same problem when I wanted to parse huge XML files.
>>>>> That's why I have written a clone of CommonSub.pm using
>>>>> "xsltproc" to parse MOBY message.
>>>>> Then the parsing time problem was removed.
>>>>>
>>>>> However, how do you fixed timeout problem ?
>>>>>
>>>>> Sebastien
>>>>>
>>>>> Chunyan Wang wrote:
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> Martin Senger wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> Could anybody explain this "problem" to me? Thanks.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>   What language are you using, what XML library in that  
>>>>>>> language?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> I am using Perl and XML::DOM. I am using
>>>>>> "genericServiceInputParser($data)" to parse the input sequence
>>>>>> in my service.
>>>>>> By the way, I want to let you know I fixed timeout problem.
>>>>>> Thanks for your suggestion.
>>>>>>
>>>>>> Joyce
>>>>>>
>>>>>>
>>>>>>
>>>>>>>   Martin
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> MOBY-dev mailing list
>>>>>> MOBY-dev at biomoby.org
>>>>>> http://www.biomoby.org/mailman/listinfo/moby-dev
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> MOBY-dev mailing list
>>>> MOBY-dev at biomoby.org
>>>> http://www.biomoby.org/mailman/listinfo/moby-dev
>>>>
>>>>
>>>>
>>>>
>>>
>>> -- 
>>> __________________________________________________________
>>>
>>> Sebastien CARRERE                        LIPM (INRA-CNRS)
>>>                      B.P.52627 -- 31326 CASTANET TOLOSAN
>>> tel:(33) 5-61-28-53-29
>>> fax:(33) 5-61-28-50-61
>>>
>>>
>>> _______________________________________________
>>> MOBY-dev mailing list
>>> MOBY-dev at biomoby.org
>>> http://www.biomoby.org/mailman/listinfo/moby-dev
>>>
>>>
>>
>>
>> Wageningen University and Research centre (WUR)
>> Laboratory of Bioinformatics
>> Transitorium (building 312) room 1034
>> Dreijenlaan 3
>> 6703 HA Wageningen
>> The Netherlands
>> phone: 0317-483 060
>> fax: 0317-483 584
>> mobile: 06-143 66 783
>> pieter.neerincx at wur.nl
>>
>>
>>
>> _______________________________________________
>> MOBY-dev mailing list
>> MOBY-dev at biomoby.org
>> http://www.biomoby.org/mailman/listinfo/moby-dev
>>
> -- 
> "Ontologists do it with the edges!"
>
> Mark Wilkinson
> Asst. Professor
> Dept. of Medical Genetics
> University of British Columbia
> PI in Bioinformatics
> iCAPTURE Centre
> St. Paul's Hospital
> Rm. 166, 1081 Burrard St.
> Vancouver, BC, V6Z 1Y6
> tel: 604 682 2344 x62129
> fax: 604 806 9274
>
> _______________________________________________
> MOBY-dev mailing list
> MOBY-dev at biomoby.org
> http://www.biomoby.org/mailman/listinfo/moby-dev
>


Wageningen University and Research centre (WUR)
Laboratory of Bioinformatics
Transitorium (building 312) room 1034
Dreijenlaan 3
6703 HA Wageningen
The Netherlands
phone: 0317-483 060
fax: 0317-483 584
mobile: 06-143 66 783
pieter.neerincx at wur.nl






More information about the MOBY-dev mailing list