[Biopython-dev] XML parsing library for new modules

Mon May 4 15:25:17 UTC 2009

--- On Mon, 5/4/09, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> > My lean is towards ElementTree for reasons of code
> clarity. SAX
> > parsers require a lot of boilerplate style code. They
> also can be
> > tricky with nested elements; I always find myself
> using a lot of "if
> > in_tag; else if in_tag" style code. ElementTree
> eliminates a lot of
> > these issues which should result in easier to maintain
> code.

This is partially true. SAX parsers can be complicated, but with some dedication reasonably clear code is also possible. The SAX parser in Bio.Entrez is not all that bad, and it can handle all kinds of different XML pages as long as a DTD is available. The prime motivation for ElementTree is that it's mutable; I don't know if that is really needed in this case. Another thing to consider is what to do with the result returned by ElementTree. Whereas it will contain all the information in the XML file, it may not represent it in a user-friendly way. You may want to take the output from ElementTree and store it in a more biopython-like object. Also keep in mind memory usage: ElementTree will keep the complete XML file in memory, whereas the SAX parser gives you more flexibility here (see below).

That said, I don't have any fundamental objections against using ElementTree.

> 
> We have been trying to avoid external library dependencies
> where
> possible (moving away from Martel for parsing has really
> helped here).
> Given ElementTree and cElementTree are included with Python
> 2.5+,
> this is only an issue for Biopython running on Python 2.4. 

I think it's OK to require Python 2.5 or later for Biopython.

> P.S. I wonder if our BLAST XML parser would get a big speed
> boost if we switched it to ElementTree instead of xml.sax?

I doubt it, since the SAX parser is pretty straightforward -- the hard part is to go through the DTD and find out how to interpret each element in the XML (this is not time-consuming though). The key point though is memory usage. With the SAX parser, you can parse the XML file in chunks, and use an iterator to return individual Blast records -- you don't need to keep the full XML file in memory. The Blast parser NCBIXML.parse does exactly that. With ElementTree, as far as I understand you read in the full XML file and keep it in memory.

--Michiel.