[Bioperl-l] XML parsers

Robin Berjon robin.berjon at expway.fr
Mon Feb 3 11:07:58 EST 2003


Hilmar Lapp wrote:
> I know a couple people out there are using XML parsers for bioperl 
> modules (Heikki, ChrisM, I guess many more). There's a variety of parser 
> modules available from CPAN. In bioperl we currently have dependencies on
> 
>     XML::Parser
>     XML::Parser::PerlSAX
>     XML::Twig
> 
> (and XML::Writer, which I guess is not exactly for parsing ...).
> 
> - What is the XML parser that people generally prefer currently, and if 
> you don't mind to mention, why? (this doesn't have to be one of the above)

XML::Parser is more or less on the deprecation slope. It's still used in places 
as a low level thing but it is very strongly recommended to *not* use it for new 
development. Only small bugfixes and tests on new versions of expat are to be 
provided, it is likely that some of its somewhat larger bugs such as those found 
in namespace support will never be fixed. The reason for this is that its 
interface is dated, everything new uses PerlSAX 2.

XML::Parser::PerlSAX is just as deprecated because it's a partial implementation 
of PerlSAX 1. PerlSAX 1 isn't compatible with PerlSAX 2 (unless you insert a 
converter).

This may give the impression that XML folks like making tools to better 
deprecate them later, but that's not so. PerlSAX 2 is stable, and while a 2.1 
may happen at some point this year a 3.0 is not currently on the map. Were it to 
happen, backwards compatibility will be maintained. PerlSAX 2 is the version 
that was heavily advocated and advertised (if you aren't noticing that from this 
post ;) and the one for which the greatest number of tools were written.

Using PerlSAX 2 you have a wide array of tools:

- several XML parsers:
   XML::SAX::Expat, wrapping XML::Parser to make it behave as a SAX2 stream, 
very useful if you have XML::Parser installed and don't want to worry about 
installing extra stuff

   XML::SAX::PurePerl, a pure Perl parser, best for portability (but *slow*)

   XML::LibXML::SAX, a parser built on top of libxml2

- an XML parser factory, XML::SAX::ParserFactory. Using a simple interface, this 
will select a SAX parser amongst those that you have installed. Very useful to 
write portable code;

- SAX parsers for non-XML data sources such as CSV, Excel, Perl data structures, 
directory trees...anything you want;

- many SAX filters doing all sorts of manipulations (there are too many to list 
here, just look for XML::Filter::*);

- a pipeline manager (and much more), XML::SAX::Machines which will make setting 
up a processing pipeline for SAX tools really simple and elegant;

- a SAX Writer, XML::SAX::Writer which is a framework to write to XML as well as 
non-XML outputs.

All of the above plug together with a high degree of interop, the combinations 
are endless :)

Don't forget to check out Kip's articles in the xml.com "Perl & XML" column (to 
resume soon).

And that's just for the (SAX) parsers.

XML::Twig isn't a parser, it's a nice tree-based interface to XML data. It is 
very useful when you have a large document you wish to process one subtree at a 
time. In the same vein see XML::Filter::Dispatcher. Both are SAX-in, SAX-out.

Another tree-based favourite is XML::LibXML. It exposes a DOM and requires the 
whole document to be in memory but it is very fast and is compatible with 
XML::LibXSLT, the interface to the best XSLT processor on earth.

I can go on for a while, but it might be better focussed if you ask specific 
questions ;)

> We have been experimenting here with XML::Simple and 
> XML::SAX::ParserFactory. The former provides a nice perl'ish view on the 
> DOM, but seems to be very slow. Has anyone played with those and made 
> experiences, positive and negative?

XML::Simple is great for simple things (it makes them easy) but it does tend to 
break at some degree of complexity. Please note that it is not a view on the 
DOM. The DOM corresponds to a certain view on XML (different from other views 
such as the XPath view).

-- 
Robin Berjon <robin.berjon at expway.fr>
Research Engineer, Expway        http://expway.fr/
7FC0 6F5F D864 EFB8 08CE  8E74 58E6 D5DB 4889 2488



More information about the Bioperl-l mailing list