[Biopython-dev] Bio.Entrez XML parsing

Sun Mar 30 14:49:41 UTC 2008

> Eric, could you attach your taxonomy XML code to this bug?
> We'd probably want to start by adding taxonomy XML parsing
> to Bio.Entrez (which I assume you are using to fetch the XML data).

I've done some thinking about XML parsers for Bio.Entrez.

I propose to add a function read() to Bio.Entrez, which returns a record suitable for the type of XML file we're trying to read (as determined by the corresponding DTD file).

Now, the various XML types can be very different from each other, and I think the actual parsing should be done by a specialized submodule of Bio.Entrez. For example, one Bio.Entrez.EInfo, one Bio.Entrez.ESummary, and so on. For Bio.Entrez.EFetch, there seem to be many different XMLs, so we'd probably have a number of submodules for it (one of them for the taxonomy XML).

The first tag received by the read() function in Bio.Entrez tells it which type of XML it is receiving (have a look at the XML files shown in chapter 6 of the tutorial for some examples), and can then decide which of the submodules of Bio.Entrez should be used for the actual parsing. Otherwise, the read() function in Bio.Entrez does very little; the actual work is done by the submodules.

If the read() function encounters an XML type for which no parser is yet available, it can raise a NotImplementedError exception.

Comments, anybody?

--Michiel

---------------------------------
Never miss a thing.   Make Yahoo your homepage.