[Bioperl-l] bioperl pulls, xml parsers push, and things get complicated

Wed Jul 19 22:40:52 UTC 2006

Hi Chris,

It seems to me the SearchIO framework isn't really appropriate for 
genomethreader, since it's more of a gene prediction program than a 
search/alignment program.

Also, w.r.t. XML parsing and buffering, I don't see how Bio::SearchIO is 
fundamentally different from the other bioperl IO systems, it still has 
a next_this(), next_that() interface,which means lots of buffering 
memory if you're doing your actual parsing with a push parser (or a tree 
parser, of course, which is buffering an expanded form of the entire 
document).  It looks like it just adds another layer of method calls for 
parser events, allowing the SearchIO to make different kinds of objects 
and stuff.

It looks like none of this changes the fact that these are all push 
parsers, and bioperl pulls, so you have to buffer a lot of stuff.  I 
guess the only really general strategies for reducing the buffering is 
a.) to break up the XML with regexps and such like Hilmar said, b.) to 
put your push parser in another process, and somehow keep it blocking in 
one of its callbacks until you're ready for its next data.

I think what I'll do with the gthxml parser is find a way to split the 
input XML into chunks and run a parser separately on each, like Hilmar 
said.  If more performance is needed, maybe a multi-process approach 
would be appropriate, but not yet.

Anyway, looking at blastxml, I have some ruminations, which fill the 
rest of this email:

Looking at SearchIO::blastxml, it looks like it's already using 
XML::SAX, which will use XML::SAX::ExpatXS if installed.  Is that 
recent?  Is blastxml faster when using the tempfile option than when 
putting the whole report in a string in memory?  If you're looking for 
speed gains, have you tried running some kind of profiling on it?  
Whenever one is out to optimize code, profiling should be stop number 
one.  Almost every time, you will be surprised at what parts of the code 
are actually eating up the most time.  Here's a perl profiling intro: 
http://perl.com/pub/a/2004/06/25/profiling.html .  The profiling 
mechansim talked about in that article is kind of old, there are also a 
bunch of newer code profiling tools available on CPAN.  I haven't used 
any of them though.  But yeah, I can't emphasize enough the importance 
of profiling if you're trying to optimize for speed.

As for memory, the blastxml parser suffers from the same handicap I was 
pondering at the start of this thread.  To see what I mean, think of 
what would happen if there were somehow 10 million HSPs in one of the 
reports?  It's buffering all of them before returning each result, and 
your machine could melt.  :-)  Things would be beautiful (and fast, 
probably) if next_hsp() would actually parse the next HSP in the report 
instead of just returning a HSP object that's sitting in memory.  But 
there's not really anything that can be done about that, I don't think.

One nice thing, the blastxml parser's memory footprint doesn't really 
suffer if you have 100,000 blast reports in your input file, because it 
splits out the reports and parses each one individually.  This I think 
is a good illustration of what Hilmar was talking about, breaking the 
input XML into chunks cuts down on the amount of buffering you have to do.

As XML parsers go, I kind of like XML::Twig, because it manages to 
combine most of the easy use of a DOM/tree parser with the better memory 
usage and speed of a push parser (like SAX and XML::Parser).  Within a 
parser callback, you have a DOM-like tree that's just the part of your 
XML document you're interested in at that time, and then you free that 
structure when you're done picking things out of it.  I'm not sure how 
fast it is, though, probably not as fast as ExpatXS.  At any rate, it is 
definitely a lot more intuitive to use than a more standard push parser, 
since if you make good choices about what elements to use as the roots 
of your twigs, you can often do your processing on a self-contained 
chunk and not have to keep track of a bunch of parse state like you 
typically need with a straight push parser like XML::Parser or a SAX parser.

Rob

Chris Fields wrote:
> The Bio::SearchIO modules are supposed work like a SAX parser, where results
> are returned as the report is parsed b/c of the occurrence of specific
> 'events' (start_element, end_element, and so on).  However, the actual
> behaviour for each module changes depending on the report type and the
> author's intention.  
>
> There was a thread about a month ago on HMMPFAM report parsing where there
> was some contention as to how to build hits(models)/HSPs(domains).  HMMPFAM
> output has one HSP per hit and is sorted on the sequence length so a
> particular hit can appear more than once, depending on how many times it
> hits along the sequence length itself.  So, to gather all the HSPs together
> under one hit you would have to parse the entire report and build up a
> Hit/HSP tree, then use the next_hit/next_hsp oterators to parse through
> everything.  Currently it just reports Hit/HSP pairs and it is up to the
> user to build that tree.
>
> In contrast, BLAST output should be capable of throwing hit/HSP clusters on
> the fly based on the report output, but is quite slow (event the XML output
> crawls).  Jason thinks it's b/c of object inheritance and instantiation; I
> think it's probably more complicated than that (there are a ton of method
> calls which tend to slow things down quite a bit as well).  
>
> I would say try using SearchIO, but instead of relying directly on object
> handler calls to create Hit/HSP objects using an object factory (which is
> where I think a majority of the speed is lost), build the data internally on
> the fly using start_element/end_element, then return hashes instead based on
> the element type triggered using end_element.  
>
> As an aside, I'm trying to switch the SearchIO::blastxml over to XML::SAX
> (using XML::SAX::ExpatXS/expat) and plan on switching it over to using
> hashes at some point, possibly starting off with a different SearchIO plugin
> module.  If you have other suggestions (XML parser of choice, ways to speed
> up parsing/retrieve data) we would be glad to hear them.
>
> Chris
>
>
>
>   
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>> bounces at lists.open-bio.org] On Behalf Of Robert Buels
>> Sent: Tuesday, July 18, 2006 7:06 PM
>> To: bioperl-l at bioperl.org
>> Subject: [Bioperl-l] bioperl pulls, xml parsers push,and things get
>> complicated
>>
>> Hi all,
>>
>> Here's a kind of abstract question about Bioperl and XML parsing:
>>
>> I'm thinking about writing a bioperl parser for genomethreader XML, and
>> I'm sort of mulling over the 'impedence mismatch' between the way
>> bioperl Bio::*IO::* modules work and the way all of the current XML
>> parsers work.  Bioperl uses a 'pull' model, where every time you want a
>> new chunk of stuff, you call $io_object->next_thing.  All the XML
>> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a
>> 'push' model, where every time they parse a chunk, they call _your_
>> code, usually via a subroutine reference you've given to the XML parser
>> when you start it up.
>>
>>  From what I can tell, current Bioperl IO modules that parse XML are
>> using push parsers to parse the whole document, holding stuff in memory,
>> then spoon-feeding it in chunks to the calling program when it calls
>> next_*().  This is fine until the input XML gets really big, in which
>> case you can quickly run out of memory.
>>
>> Does anybody have good ideas for nice, robust ways of writing a bioperl
>> IO module for really big input XML files?  There don't seem to be any
>> perl pull parsers for XML.  All I've dug up so far would be having the
>> XML push parser running in a different thread or process, pushing chunks
>> of data into a pipe or similar structure that blocks the progress of the
>> push parser until the pulling bioperl code wants the next piece of data,
>> but there are plenty of ugly issues with that, whether one were too use
>> perl threads for it (aaagh!) or fork and push some kind of intermediate
>> format through a pipe or socket between the two processes (eek!).
>>
>> So, um, if you've read this far, do you have any ideas?
>>
>> Rob
>>
>> --
>> Robert Buels
>> SGN Bioinformatics Analyst
>> 252A Emerson Hall, Cornell University
>> Ithaca, NY  14853
>> Tel: 503-889-8539
>> rmb32 at cornell.edu
>> http://www.sgn.cornell.edu
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>     
>
>   

-- 
Robert Buels
SGN Bioinformatics Analyst
252A Emerson Hall, Cornell University
Ithaca, NY  14853
Tel: 503-889-8539
rmb32 at cornell.edu
http://www.sgn.cornell.edu