[Bioperl-l] XML parsers
Robin Berjon
robin.berjon at expway.fr
Mon Feb 3 11:07:58 EST 2003
Hilmar Lapp wrote:
> I know a couple people out there are using XML parsers for bioperl
> modules (Heikki, ChrisM, I guess many more). There's a variety of parser
> modules available from CPAN. In bioperl we currently have dependencies on
>
> XML::Parser
> XML::Parser::PerlSAX
> XML::Twig
>
> (and XML::Writer, which I guess is not exactly for parsing ...).
>
> - What is the XML parser that people generally prefer currently, and if
> you don't mind to mention, why? (this doesn't have to be one of the above)
XML::Parser is more or less on the deprecation slope. It's still used in places
as a low level thing but it is very strongly recommended to *not* use it for new
development. Only small bugfixes and tests on new versions of expat are to be
provided, it is likely that some of its somewhat larger bugs such as those found
in namespace support will never be fixed. The reason for this is that its
interface is dated, everything new uses PerlSAX 2.
XML::Parser::PerlSAX is just as deprecated because it's a partial implementation
of PerlSAX 1. PerlSAX 1 isn't compatible with PerlSAX 2 (unless you insert a
converter).
This may give the impression that XML folks like making tools to better
deprecate them later, but that's not so. PerlSAX 2 is stable, and while a 2.1
may happen at some point this year a 3.0 is not currently on the map. Were it to
happen, backwards compatibility will be maintained. PerlSAX 2 is the version
that was heavily advocated and advertised (if you aren't noticing that from this
post ;) and the one for which the greatest number of tools were written.
Using PerlSAX 2 you have a wide array of tools:
- several XML parsers:
XML::SAX::Expat, wrapping XML::Parser to make it behave as a SAX2 stream,
very useful if you have XML::Parser installed and don't want to worry about
installing extra stuff
XML::SAX::PurePerl, a pure Perl parser, best for portability (but *slow*)
XML::LibXML::SAX, a parser built on top of libxml2
- an XML parser factory, XML::SAX::ParserFactory. Using a simple interface, this
will select a SAX parser amongst those that you have installed. Very useful to
write portable code;
- SAX parsers for non-XML data sources such as CSV, Excel, Perl data structures,
directory trees...anything you want;
- many SAX filters doing all sorts of manipulations (there are too many to list
here, just look for XML::Filter::*);
- a pipeline manager (and much more), XML::SAX::Machines which will make setting
up a processing pipeline for SAX tools really simple and elegant;
- a SAX Writer, XML::SAX::Writer which is a framework to write to XML as well as
non-XML outputs.
All of the above plug together with a high degree of interop, the combinations
are endless :)
Don't forget to check out Kip's articles in the xml.com "Perl & XML" column (to
resume soon).
And that's just for the (SAX) parsers.
XML::Twig isn't a parser, it's a nice tree-based interface to XML data. It is
very useful when you have a large document you wish to process one subtree at a
time. In the same vein see XML::Filter::Dispatcher. Both are SAX-in, SAX-out.
Another tree-based favourite is XML::LibXML. It exposes a DOM and requires the
whole document to be in memory but it is very fast and is compatible with
XML::LibXSLT, the interface to the best XSLT processor on earth.
I can go on for a while, but it might be better focussed if you ask specific
questions ;)
> We have been experimenting here with XML::Simple and
> XML::SAX::ParserFactory. The former provides a nice perl'ish view on the
> DOM, but seems to be very slow. Has anyone played with those and made
> experiences, positive and negative?
XML::Simple is great for simple things (it makes them easy) but it does tend to
break at some degree of complexity. Please note that it is not a view on the
DOM. The DOM corresponds to a certain view on XML (different from other views
such as the XPath view).
--
Robin Berjon <robin.berjon at expway.fr>
Research Engineer, Expway http://expway.fr/
7FC0 6F5F D864 EFB8 08CE 8E74 58E6 D5DB 4889 2488
More information about the Bioperl-l
mailing list