[Bioperl-l] bioperl pulls, xml parsers push, and things get complicated

Wed Jul 19 18:45:55 UTC 2006

Yeah, we use XML::SAX, with XML::SAX::ExpatXS and expat, for
SearchIO::blastxml.  It previously used XML::Parser::PerlSAX but that didn't
support SAX2-based parsing.  XML::Twig is also used quite a bit

Jason added his thoughts about this to the wiki:

http://www.bioperl.org/wiki/XML_parsers

Personally, I use XML::Simple with EUtilities because the XML returned is
remarkably simple and normally fairly short.  The trick is making sure when
parsing data to dereference everything properly since XML::Simple stores
everything in an elaborate data structure.  I plan on switching to
XML::SAX::ExpatXS or XML::Twig soon.

Chris

> There are a lot of different XML processing strategies. Most fall into
> two categories: stream-based and tree-based.
> 
> With the stream-based strategy, the parser continuously alerts a program
> to patterns in the XML. The parser functions like a pipeline, taking XML
> markup on one end and pumping out processed nuggets of data to your
> program.
> 
> With the tree-based strategy, the parser keeps the data to itself until
> the very end, when it presents a complete model of the document to your
> program. The whole point to this strategy is that your program can pull
> out any data it needs, in any order.
> 
> Most of the times I use tree-based strategies because they place all of
> the data into a structure which lets me to access any internal node
> using array/hash references. The simplest parser for this is XML::Simple
> using XML::Parser as the 'preferred parser' (which is built on top of
> XML::Parser::Expat, which is a wrapper around the expat library).
> 
> More advanced parsers (both stream and tree-based) are:
> 
> * XML::LibXML (a wrapper for libxml2's C library)
> * XML::Grove (takes a tree and changes it into an object hierarchy. Each
> node type is represented by a different class)
> * XML::PYX (for repackaging XML as a stream of easily recognizable and
> transmutable symbols)
> * XML::SimpleObject (changes a hierarchy of lists into a hierarchy of
> objects)
> * XML::XPath (for writing expressions that pinpoint specific pieces of
> documents)
> 
> There are also some standards-based solutions like:
> 
> * XML::SAX (Simple API for XML) for event streams.
> * XML::DOM (Document Object Model) for tree processing.
> 
> Your strategy of choice depends a lot on the type of XML files you want
> to parse. Understanding the structure of the files and deciding which is
> the data you want to extract from them is a fundamental step to choose
> the appropriate method/parser to use.
> 
> Just my 2 cents :)
> 
> Regards,
> Mauricio.
> 
> Chris Fields wrote:
> > The Bio::SearchIO modules are supposed work like a SAX parser, where
> results
> > are returned as the report is parsed b/c of the occurrence of specific
> > 'events' (start_element, end_element, and so on).  However, the actual
> > behaviour for each module changes depending on the report type and the
> > author's intention.
> >
> > There was a thread about a month ago on HMMPFAM report parsing where
> there
> > was some contention as to how to build hits(models)/HSPs(domains).
> HMMPFAM
> > output has one HSP per hit and is sorted on the sequence length so a
> > particular hit can appear more than once, depending on how many times it
> > hits along the sequence length itself.  So, to gather all the HSPs
> together
> > under one hit you would have to parse the entire report and build up a
> > Hit/HSP tree, then use the next_hit/next_hsp oterators to parse through
> > everything.  Currently it just reports Hit/HSP pairs and it is up to the
> > user to build that tree.
> >
> > In contrast, BLAST output should be capable of throwing hit/HSP clusters
> on
> > the fly based on the report output, but is quite slow (event the XML
> output
> > crawls).  Jason thinks it's b/c of object inheritance and instantiation;
> I
> > think it's probably more complicated than that (there are a ton of
> method
> > calls which tend to slow things down quite a bit as well).
> >
> > I would say try using SearchIO, but instead of relying directly on
> object
> > handler calls to create Hit/HSP objects using an object factory (which
> is
> > where I think a majority of the speed is lost), build the data
> internally on
> > the fly using start_element/end_element, then return hashes instead
> based on
> > the element type triggered using end_element.
> >
> > As an aside, I'm trying to switch the SearchIO::blastxml over to
> XML::SAX
> > (using XML::SAX::ExpatXS/expat) and plan on switching it over to using
> > hashes at some point, possibly starting off with a different SearchIO
> plugin
> > module.  If you have other suggestions (XML parser of choice, ways to
> speed
> > up parsing/retrieve data) we would be glad to hear them.
> >
> > Chris
> >
> >
> >
> >> -----Original Message-----
> >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> >> bounces at lists.open-bio.org] On Behalf Of Robert Buels
> >> Sent: Tuesday, July 18, 2006 7:06 PM
> >> To: bioperl-l at bioperl.org
> >> Subject: [Bioperl-l] bioperl pulls, xml parsers push,and things get
> >> complicated
> >>
> >> Hi all,
> >>
> >> Here's a kind of abstract question about Bioperl and XML parsing:
> >>
> >> I'm thinking about writing a bioperl parser for genomethreader XML, and
> >> I'm sort of mulling over the 'impedence mismatch' between the way
> >> bioperl Bio::*IO::* modules work and the way all of the current XML
> >> parsers work.  Bioperl uses a 'pull' model, where every time you want a
> >> new chunk of stuff, you call $io_object->next_thing.  All the XML
> >> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a
> >> 'push' model, where every time they parse a chunk, they call _your_
> >> code, usually via a subroutine reference you've given to the XML parser
> >> when you start it up.
> >>
> >>  From what I can tell, current Bioperl IO modules that parse XML are
> >> using push parsers to parse the whole document, holding stuff in
> memory,
> >> then spoon-feeding it in chunks to the calling program when it calls
> >> next_*().  This is fine until the input XML gets really big, in which
> >> case you can quickly run out of memory.
> >>
> >> Does anybody have good ideas for nice, robust ways of writing a bioperl
> >> IO module for really big input XML files?  There don't seem to be any
> >> perl pull parsers for XML.  All I've dug up so far would be having the
> >> XML push parser running in a different thread or process, pushing
> chunks
> >> of data into a pipe or similar structure that blocks the progress of
> the
> >> push parser until the pulling bioperl code wants the next piece of
> data,
> >> but there are plenty of ugly issues with that, whether one were too use
> >> perl threads for it (aaagh!) or fork and push some kind of intermediate
> >> format through a pipe or socket between the two processes (eek!).
> >>
> >> So, um, if you've read this far, do you have any ideas?
> >>
> >> Rob
> >>
> >> --
> >> Robert Buels
> >> SGN Bioinformatics Analyst
> >> 252A Emerson Hall, Cornell University
> >> Ithaca, NY  14853
> >> Tel: 503-889-8539
> >> rmb32 at cornell.edu
> >> http://www.sgn.cornell.edu
> >>
> >>
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> 
> --
> MAURICIO HERRERA CUADRA
> arareko at campus.iztacala.unam.mx
> Laboratorio de Genética
> Unidad de Morfofisiología y Función
> Facultad de Estudios Superiores Iztacala, UNAM
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l