[Biopython-dev] XML parsing library for new modules
p.j.a.cock at googlemail.com
Mon May 4 15:45:12 UTC 2009
>>> My lean[ing] is towards ElementTree for reasons of code
>>> clarity. SAX parsers require a lot of boilerplate style code.
>>> They also can be tricky with nested elements; I always
>>> find myself using a lot of "if in_tag; else if in_tag" style
>>> code. ElementTree eliminates a lot of these issues
>>> which should result in easier to maintain code.
> This is partially true. SAX parsers can be complicated, but
> with some dedication reasonably clear code is also possible.
> The SAX parser in Bio.Entrez is not all that bad, and it can
> handle all kinds of different XML pages as long as a DTD
> is available. The prime motivation for ElementTree is that
> it's mutable; I don't know if that is really needed in this case.
Eric will have to answer that regarding PhyloXML, but if the
aim is to turn it into one of our existing tree objects, then
having the XML structure mutable is irrelevant.
> Another thing to consider is what to do with the result
> returned by ElementTree. Whereas it will contain all the
> information in the XML file, it may not represent it in a
> user-friendly way. You may want to take the output from
> ElementTree and store it in a more biopython-like object.
> Also keep in mind memory usage: ElementTree will keep
> the complete XML file in memory, whereas the SAX
> parser gives you more flexibility here (see below).
Something for Eric to consider.
> That said, I don't have any fundamental objections
> against using ElementTree.
>> We have been trying to avoid external library dependencies
>> where possible (moving away from Martel for parsing has
>> really helped here). Given ElementTree and cElementTree
>> are included with Python 2.5+, this is only an issue for
>> Biopython running on Python 2.4.
> I think it's OK to require Python 2.5 or later for Biopython.
As this stage I disagree, Python 2.4 would still be widely
used on production servers running stable distributions.
Also we'd have to give a couple of releases notice about
dropping Python 2.4 support. In any case, if we want to
use ElementTree with Python 2.4 this is possible.
>> P.S. I wonder if our BLAST XML parser would get a big speed
>> boost if we switched it to ElementTree instead of xml.sax?
> I doubt it, since the SAX parser is pretty straightforward --
> the hard part is to go through the DTD and find out how to
> interpret each element in the XML (this is not
> time-consuming though). The key point though is memory
> usage. With the SAX parser, you can parse the XML file in
> chunks, and use an iterator to return individual Blast records
> -- you don't need to keep the full XML file in memory. The
> Blast parser NCBIXML.parse does exactly that. With
> ElementTree, as far as I understand you read in the full
> XML file and keep it in memory.
Keeping a full BLAST XML file in memory would be a bad idea,
and would spoil the memory savings of the iterator approach
to parsing it. So ElementTree isn't suitable for everything ;)
More information about the Biopython-dev