[Biopython] NCBI e-utils parser upgrade

Michiel de Hoon mjldehoon at yahoo.com
Fri Nov 21 01:20:14 UTC 2014


Hi Ivan,

I am the original author of Bio.Entrez.
The parser in Bio.Entrez consists of two parts: The XML parser and the DTD parser.
The DTD parser is used to determine how the elements in the XML file should be represented in Python.
To allow schemas, all that is needed is to write a parser for the schema; the XML parser is unchanged.
In Bio/Entrez/Parser.py, you will find the method startNamespaceDeclHandler;
currently it just raises a NotImplementedError.
If you try the Bio.Entrez parser on your XML file, you will see that this error gets raised.
So all you would have to do is to implement startNamespaceDeclHandler;
it should parallel externalEntityRefHandler, which parses DTD files, though the bulk of the work is done in elementDecl.
Please let me know if you run into any problems.

Best,
-Michiel.




--------------------------------------------
On Fri, 11/21/14, Ivan Erill <ivan.erill at gmail.com> wrote:

 Subject: [Biopython] NCBI e-utils parser upgrade
 To: biopython at mailman.open-bio.org
 Date: Friday, November 21, 2014, 2:42 AM
 
 Hi all,
 As part of my
 work, I need to deal with the new WP protein records at NCBI
 and, specifically, with the information on their coding
 sequences. This information is returned by E-utils through a
 an integrated protein report type of view:
 http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=231025&rettype=ipg
 
 which does not use
 a DTD for the XML, but rather a schema. Although there has
 been no formal announcement, I've been talking to NCBI
 people and they tell me that they will progressively be
 moving to schemas (which provide more fine grained
 validation specification). Specifically, all new XML exports
 from NCBI will be using schemas. I don't believe that
 existing DTDs are going to be replaced by schemas for
 now.
 My original
 through was to branch an update for the current XML parser
 in BioPython, but it looks like using schemas would be a
 major overhaul of the existing code-base and it might make
 more sense to develop a parallel parser, so I first wanted
 to check on what approach you guys would prefer to do
 code-wise.
 Regards,
 Ivan
 
 -----Inline Attachment Follows-----
 
 _______________________________________________
 Biopython mailing list  -  Biopython at mailman.open-bio.org
 http://mailman.open-bio.org/mailman/listinfo/biopython



More information about the Biopython mailing list