[Bioperl-l] bioperl pulls, xml parsers push, and things get complicated

Robert Buels rmb32 at cornell.edu
Wed Jul 19 00:06:02 UTC 2006


Hi all,

Here's a kind of abstract question about Bioperl and XML parsing:

I'm thinking about writing a bioperl parser for genomethreader XML, and 
I'm sort of mulling over the 'impedence mismatch' between the way 
bioperl Bio::*IO::* modules work and the way all of the current XML 
parsers work.  Bioperl uses a 'pull' model, where every time you want a 
new chunk of stuff, you call $io_object->next_thing.  All the XML 
parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a 
'push' model, where every time they parse a chunk, they call _your_ 
code, usually via a subroutine reference you've given to the XML parser 
when you start it up.

 From what I can tell, current Bioperl IO modules that parse XML are 
using push parsers to parse the whole document, holding stuff in memory, 
then spoon-feeding it in chunks to the calling program when it calls 
next_*().  This is fine until the input XML gets really big, in which 
case you can quickly run out of memory.

Does anybody have good ideas for nice, robust ways of writing a bioperl 
IO module for really big input XML files?  There don't seem to be any 
perl pull parsers for XML.  All I've dug up so far would be having the 
XML push parser running in a different thread or process, pushing chunks 
of data into a pipe or similar structure that blocks the progress of the 
push parser until the pulling bioperl code wants the next piece of data, 
but there are plenty of ugly issues with that, whether one were too use 
perl threads for it (aaagh!) or fork and push some kind of intermediate 
format through a pipe or socket between the two processes (eek!).

So, um, if you've read this far, do you have any ideas?

Rob

-- 
Robert Buels
SGN Bioinformatics Analyst
252A Emerson Hall, Cornell University
Ithaca, NY  14853
Tel: 503-889-8539
rmb32 at cornell.edu
http://www.sgn.cornell.edu





More information about the Bioperl-l mailing list