[Bioperl-l] bioperl pulls, xml parsers push, and things get complicated
Robert Buels
rmb32 at cornell.edu
Wed Jul 19 22:40:52 UTC 2006
Hi Chris,
It seems to me the SearchIO framework isn't really appropriate for
genomethreader, since it's more of a gene prediction program than a
search/alignment program.
Also, w.r.t. XML parsing and buffering, I don't see how Bio::SearchIO is
fundamentally different from the other bioperl IO systems, it still has
a next_this(), next_that() interface,which means lots of buffering
memory if you're doing your actual parsing with a push parser (or a tree
parser, of course, which is buffering an expanded form of the entire
document). It looks like it just adds another layer of method calls for
parser events, allowing the SearchIO to make different kinds of objects
and stuff.
It looks like none of this changes the fact that these are all push
parsers, and bioperl pulls, so you have to buffer a lot of stuff. I
guess the only really general strategies for reducing the buffering is
a.) to break up the XML with regexps and such like Hilmar said, b.) to
put your push parser in another process, and somehow keep it blocking in
one of its callbacks until you're ready for its next data.
I think what I'll do with the gthxml parser is find a way to split the
input XML into chunks and run a parser separately on each, like Hilmar
said. If more performance is needed, maybe a multi-process approach
would be appropriate, but not yet.
Anyway, looking at blastxml, I have some ruminations, which fill the
rest of this email:
Looking at SearchIO::blastxml, it looks like it's already using
XML::SAX, which will use XML::SAX::ExpatXS if installed. Is that
recent? Is blastxml faster when using the tempfile option than when
putting the whole report in a string in memory? If you're looking for
speed gains, have you tried running some kind of profiling on it?
Whenever one is out to optimize code, profiling should be stop number
one. Almost every time, you will be surprised at what parts of the code
are actually eating up the most time. Here's a perl profiling intro:
http://perl.com/pub/a/2004/06/25/profiling.html . The profiling
mechansim talked about in that article is kind of old, there are also a
bunch of newer code profiling tools available on CPAN. I haven't used
any of them though. But yeah, I can't emphasize enough the importance
of profiling if you're trying to optimize for speed.
As for memory, the blastxml parser suffers from the same handicap I was
pondering at the start of this thread. To see what I mean, think of
what would happen if there were somehow 10 million HSPs in one of the
reports? It's buffering all of them before returning each result, and
your machine could melt. :-) Things would be beautiful (and fast,
probably) if next_hsp() would actually parse the next HSP in the report
instead of just returning a HSP object that's sitting in memory. But
there's not really anything that can be done about that, I don't think.
One nice thing, the blastxml parser's memory footprint doesn't really
suffer if you have 100,000 blast reports in your input file, because it
splits out the reports and parses each one individually. This I think
is a good illustration of what Hilmar was talking about, breaking the
input XML into chunks cuts down on the amount of buffering you have to do.
As XML parsers go, I kind of like XML::Twig, because it manages to
combine most of the easy use of a DOM/tree parser with the better memory
usage and speed of a push parser (like SAX and XML::Parser). Within a
parser callback, you have a DOM-like tree that's just the part of your
XML document you're interested in at that time, and then you free that
structure when you're done picking things out of it. I'm not sure how
fast it is, though, probably not as fast as ExpatXS. At any rate, it is
definitely a lot more intuitive to use than a more standard push parser,
since if you make good choices about what elements to use as the roots
of your twigs, you can often do your processing on a self-contained
chunk and not have to keep track of a bunch of parse state like you
typically need with a straight push parser like XML::Parser or a SAX parser.
Rob
Chris Fields wrote:
> The Bio::SearchIO modules are supposed work like a SAX parser, where results
> are returned as the report is parsed b/c of the occurrence of specific
> 'events' (start_element, end_element, and so on). However, the actual
> behaviour for each module changes depending on the report type and the
> author's intention.
>
> There was a thread about a month ago on HMMPFAM report parsing where there
> was some contention as to how to build hits(models)/HSPs(domains). HMMPFAM
> output has one HSP per hit and is sorted on the sequence length so a
> particular hit can appear more than once, depending on how many times it
> hits along the sequence length itself. So, to gather all the HSPs together
> under one hit you would have to parse the entire report and build up a
> Hit/HSP tree, then use the next_hit/next_hsp oterators to parse through
> everything. Currently it just reports Hit/HSP pairs and it is up to the
> user to build that tree.
>
> In contrast, BLAST output should be capable of throwing hit/HSP clusters on
> the fly based on the report output, but is quite slow (event the XML output
> crawls). Jason thinks it's b/c of object inheritance and instantiation; I
> think it's probably more complicated than that (there are a ton of method
> calls which tend to slow things down quite a bit as well).
>
> I would say try using SearchIO, but instead of relying directly on object
> handler calls to create Hit/HSP objects using an object factory (which is
> where I think a majority of the speed is lost), build the data internally on
> the fly using start_element/end_element, then return hashes instead based on
> the element type triggered using end_element.
>
> As an aside, I'm trying to switch the SearchIO::blastxml over to XML::SAX
> (using XML::SAX::ExpatXS/expat) and plan on switching it over to using
> hashes at some point, possibly starting off with a different SearchIO plugin
> module. If you have other suggestions (XML parser of choice, ways to speed
> up parsing/retrieve data) we would be glad to hear them.
>
> Chris
>
>
>
>
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>> bounces at lists.open-bio.org] On Behalf Of Robert Buels
>> Sent: Tuesday, July 18, 2006 7:06 PM
>> To: bioperl-l at bioperl.org
>> Subject: [Bioperl-l] bioperl pulls, xml parsers push,and things get
>> complicated
>>
>> Hi all,
>>
>> Here's a kind of abstract question about Bioperl and XML parsing:
>>
>> I'm thinking about writing a bioperl parser for genomethreader XML, and
>> I'm sort of mulling over the 'impedence mismatch' between the way
>> bioperl Bio::*IO::* modules work and the way all of the current XML
>> parsers work. Bioperl uses a 'pull' model, where every time you want a
>> new chunk of stuff, you call $io_object->next_thing. All the XML
>> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a
>> 'push' model, where every time they parse a chunk, they call _your_
>> code, usually via a subroutine reference you've given to the XML parser
>> when you start it up.
>>
>> From what I can tell, current Bioperl IO modules that parse XML are
>> using push parsers to parse the whole document, holding stuff in memory,
>> then spoon-feeding it in chunks to the calling program when it calls
>> next_*(). This is fine until the input XML gets really big, in which
>> case you can quickly run out of memory.
>>
>> Does anybody have good ideas for nice, robust ways of writing a bioperl
>> IO module for really big input XML files? There don't seem to be any
>> perl pull parsers for XML. All I've dug up so far would be having the
>> XML push parser running in a different thread or process, pushing chunks
>> of data into a pipe or similar structure that blocks the progress of the
>> push parser until the pulling bioperl code wants the next piece of data,
>> but there are plenty of ugly issues with that, whether one were too use
>> perl threads for it (aaagh!) or fork and push some kind of intermediate
>> format through a pipe or socket between the two processes (eek!).
>>
>> So, um, if you've read this far, do you have any ideas?
>>
>> Rob
>>
>> --
>> Robert Buels
>> SGN Bioinformatics Analyst
>> 252A Emerson Hall, Cornell University
>> Ithaca, NY 14853
>> Tel: 503-889-8539
>> rmb32 at cornell.edu
>> http://www.sgn.cornell.edu
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>
>
--
Robert Buels
SGN Bioinformatics Analyst
252A Emerson Hall, Cornell University
Ithaca, NY 14853
Tel: 503-889-8539
rmb32 at cornell.edu
http://www.sgn.cornell.edu
More information about the Bioperl-l
mailing list