[Bioperl-l] XML parsers

Mon Feb 3 09:54:03 EST 2003

Thanks for your input. Very comprehensive, very helpful. You may have 
noticed that my working with XML has been limited ... -hilmar

On Monday, February 3, 2003, at 02:07  AM, Robin Berjon wrote:

> Hilmar Lapp wrote:
>> I know a couple people out there are using XML parsers for bioperl 
>> modules (Heikki, ChrisM, I guess many more). There's a variety of 
>> parser modules available from CPAN. In bioperl we currently have 
>> dependencies on
>>     XML::Parser
>>     XML::Parser::PerlSAX
>>     XML::Twig
>> (and XML::Writer, which I guess is not exactly for parsing ...).
>> - What is the XML parser that people generally prefer currently, and 
>> if you don't mind to mention, why? (this doesn't have to be one of 
>> the above)
>
> XML::Parser is more or less on the deprecation slope. It's still used 
> in places as a low level thing but it is very strongly recommended to 
> *not* use it for new development. Only small bugfixes and tests on new 
> versions of expat are to be provided, it is likely that some of its 
> somewhat larger bugs such as those found in namespace support will 
> never be fixed. The reason for this is that its interface is dated, 
> everything new uses PerlSAX 2.
>
> XML::Parser::PerlSAX is just as deprecated because it's a partial 
> implementation of PerlSAX 1. PerlSAX 1 isn't compatible with PerlSAX 2 
> (unless you insert a converter).
>
> This may give the impression that XML folks like making tools to 
> better deprecate them later, but that's not so. PerlSAX 2 is stable, 
> and while a 2.1 may happen at some point this year a 3.0 is not 
> currently on the map. Were it to happen, backwards compatibility will 
> be maintained. PerlSAX 2 is the version that was heavily advocated and 
> advertised (if you aren't noticing that from this post ;) and the one 
> for which the greatest number of tools were written.
>
> Using PerlSAX 2 you have a wide array of tools:
>
> - several XML parsers:
>   XML::SAX::Expat, wrapping XML::Parser to make it behave as a SAX2 
> stream, very useful if you have XML::Parser installed and don't want 
> to worry about installing extra stuff
>
>   XML::SAX::PurePerl, a pure Perl parser, best for portability (but 
> *slow*)
>
>   XML::LibXML::SAX, a parser built on top of libxml2
>
> - an XML parser factory, XML::SAX::ParserFactory. Using a simple 
> interface, this will select a SAX parser amongst those that you have 
> installed. Very useful to write portable code;
>
> - SAX parsers for non-XML data sources such as CSV, Excel, Perl data 
> structures, directory trees...anything you want;
>
> - many SAX filters doing all sorts of manipulations (there are too 
> many to list here, just look for XML::Filter::*);
>
> - a pipeline manager (and much more), XML::SAX::Machines which will 
> make setting up a processing pipeline for SAX tools really simple and 
> elegant;
>
> - a SAX Writer, XML::SAX::Writer which is a framework to write to XML 
> as well as non-XML outputs.
>
> All of the above plug together with a high degree of interop, the 
> combinations are endless :)
>
> Don't forget to check out Kip's articles in the xml.com "Perl & XML" 
> column (to resume soon).
>
> And that's just for the (SAX) parsers.
>
> XML::Twig isn't a parser, it's a nice tree-based interface to XML 
> data. It is very useful when you have a large document you wish to 
> process one subtree at a time. In the same vein see 
> XML::Filter::Dispatcher. Both are SAX-in, SAX-out.
>
> Another tree-based favourite is XML::LibXML. It exposes a DOM and 
> requires the whole document to be in memory but it is very fast and is 
> compatible with XML::LibXSLT, the interface to the best XSLT processor 
> on earth.
>
> I can go on for a while, but it might be better focussed if you ask 
> specific questions ;)
>
>> We have been experimenting here with XML::Simple and 
>> XML::SAX::ParserFactory. The former provides a nice perl'ish view on 
>> the DOM, but seems to be very slow. Has anyone played with those and 
>> made experiences, positive and negative?
>
> XML::Simple is great for simple things (it makes them easy) but it 
> does tend to break at some degree of complexity. Please note that it 
> is not a view on the DOM. The DOM corresponds to a certain view on XML 
> (different from other views such as the XPath view).
>
> -- 
> Robin Berjon <robin.berjon at expway.fr>
> Research Engineer, Expway        http://expway.fr/
> 7FC0 6F5F D864 EFB8 08CE  8E74 58E6 D5DB 4889 2488
>
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------