<div dir="ltr">For your last issue, if you don't mind needing to disentangle the data after you've pulled it from the XML document, you can use this pattern to convert the document exactly into an identical nested collection of dictionaries:<br><br><div><div><font face="monospace, monospace">def recursive_dict(element):</font></div><div><font face="monospace, monospace"> data_dict = dict(element.attrib)</font></div><div><font face="monospace, monospace"> children = map(recursive_dict, element)</font></div><div><font face="monospace, monospace"> children_nodes = defaultdict(list)</font></div><div><font face="monospace, monospace"> clean_nodes = {}</font></div><div><font face="monospace, monospace"> for node, data in children:</font></div><div><font face="monospace, monospace"> children_nodes[node].append(data)</font></div><div><font face="monospace, monospace"> for node, data_list in children_nodes.items():</font></div><div><font face="monospace, monospace"> clean_nodes[node] = data_list[0] if len(data_list) == 1 else data_list</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace"> if clean_nodes:</font></div><div><font face="monospace, monospace"> data_dict.update(clean_nodes)</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace"> if element.text is not None and not element.text.isspace():</font></div><div><font face="monospace, monospace"> data_dict['text'] = element.text</font></div><div><font face="monospace, monospace"> if len(data_dict) == 1 and 'text' in data_dict:</font></div><div><font face="monospace, monospace"> data_dict = data_dict['text']</font></div><div><font face="monospace, monospace"> tag = element.tag</font></div><div><font face="monospace, monospace"> return tag, data_dict</font></div></div><div><font face="monospace, monospace"><br></font></div><div><font face="arial, helvetica, sans-serif">Feed it the root of the ElementTree you want to parse, and it will return the complete tree in dictionary form. </font></div><div><font face="arial, helvetica, sans-serif"><br></font></div><div><font face="arial, helvetica, sans-serif">From that dictionary you can infer an ad-hoc schema, which will most likely be dependent on the class of organism you're looking at.</font></div><div><font face="arial, helvetica, sans-serif"><br></font></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Sun, May 17, 2015 at 4:24 PM, Anna Simpson <span dir="ltr"><<a href="mailto:acsimpson@gmail.com" target="_blank">acsimpson@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div>Hi all,<br></div>I've been trying to parse xml files from an efetch query to the bioproject database, and kept getting an error message about no dtd (and validation=False gets me no data at all) when using Entrez.read or Entrez.parse. I found a post on this mailing list from 2013, where a gentleman had the same problem - he emailed NCBI and was told the following: <br><br>
"Yes this is the "normal" but it is an oversight as a dtd was never created
for this database. I will have to open a ticket to the developers to create
this and have it included in the XML and on the DTD web page."<br><br>I've emailed NCBI about this again but I'm guessing there still isn't one (and I can't find it in the DTD index page). But my various googlings have led me to find that there is a schema for bioproject, and that perhaps, somehow, it could be used to parse these xml files. How might I go about doing that?<br><br>I've been trying to use xml parsers like element tree and Beautiful Soup but keep running into walls (how to stick an entrez handle into a parser, how to get it to give me deeply nested information when the nesting is different for each xml document I get and I'm running this through a loop) so it would be great if I could ...stop doing that.<br><br></div><div>Thanks,<br></div><div>Anna<br></div><div>University of Washington, Seattle<br></div></div>
<br>_______________________________________________<br>
Biopython mailing list - <a href="mailto:Biopython@mailman.open-bio.org">Biopython@mailman.open-bio.org</a><br>
<a href="http://mailman.open-bio.org/mailman/listinfo/biopython" target="_blank">http://mailman.open-bio.org/mailman/listinfo/biopython</a><br></blockquote></div><br></div>