[Bioperl-l] bioperl pulls, xml parsers push, and things get complicated

Robert Buels rmb32 at cornell.edu
Wed Jul 19 19:30:28 UTC 2006


POE is a really neat thing, I didn't know about it before.  Something 
tells me, however, that I would have trouble convincing people to 
install POE as a dependency for a genomethreader output parser.  ;-)  I 
hope I'll have the opportunity to use it sometime.

For the curious, here's a nice intro to POE:
http://perl.com/pub/a/2001/01/poe.html
And the POE main site:
http://poe.perl.org/

Rob

aaron.j.mackey at GSK.COM wrote:
> There are 3rd generation XML "Pull" parsers (also called "StAX" for 
> Streaming API for XML), but they seem to still be stuck in Java land (e.g. 
> "MXP1")
>
> You could probably use POE to setup a state machine that used XML::Twig to 
> "push" units of XML content onto a stack, to be read by your "next_*" pull 
> method (where the XML::Twig push "stalled" until the "next_*" method was 
> called, and vice versa).
>
> -Aaron
>
> bioperl-l-bounces at lists.open-bio.org wrote on 07/18/2006 08:06:02 PM:
>
>   
>> Hi all,
>>
>> Here's a kind of abstract question about Bioperl and XML parsing:
>>
>> I'm thinking about writing a bioperl parser for genomethreader XML, and 
>> I'm sort of mulling over the 'impedence mismatch' between the way 
>> bioperl Bio::*IO::* modules work and the way all of the current XML 
>> parsers work.  Bioperl uses a 'pull' model, where every time you want a 
>> new chunk of stuff, you call $io_object->next_thing.  All the XML 
>> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a 
>> 'push' model, where every time they parse a chunk, they call _your_ 
>> code, usually via a subroutine reference you've given to the XML parser 
>> when you start it up.
>>
>>  From what I can tell, current Bioperl IO modules that parse XML are 
>> using push parsers to parse the whole document, holding stuff in memory, 
>>     
>
>   
>> then spoon-feeding it in chunks to the calling program when it calls 
>> next_*().  This is fine until the input XML gets really big, in which 
>> case you can quickly run out of memory.
>>
>> Does anybody have good ideas for nice, robust ways of writing a bioperl 
>> IO module for really big input XML files?  There don't seem to be any 
>> perl pull parsers for XML.  All I've dug up so far would be having the 
>> XML push parser running in a different thread or process, pushing chunks 
>>     
>
>   
>> of data into a pipe or similar structure that blocks the progress of the 
>>     
>
>   
>> push parser until the pulling bioperl code wants the next piece of data, 
>>     
>
>   
>> but there are plenty of ugly issues with that, whether one were too use 
>> perl threads for it (aaagh!) or fork and push some kind of intermediate 
>> format through a pipe or socket between the two processes (eek!).
>>
>> So, um, if you've read this far, do you have any ideas?
>>
>> Rob
>>
>> -- 
>> Robert Buels
>> SGN Bioinformatics Analyst
>> 252A Emerson Hall, Cornell University
>> Ithaca, NY  14853
>> Tel: 503-889-8539
>> rmb32 at cornell.edu
>> http://www.sgn.cornell.edu
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>     
>
>
>   

-- 
Robert Buels
SGN Bioinformatics Analyst
252A Emerson Hall, Cornell University
Ithaca, NY  14853
Tel: 503-889-8539
rmb32 at cornell.edu
http://www.sgn.cornell.edu





More information about the Bioperl-l mailing list