[Biopython] Entrez.parse error

Wed Dec 21 08:27:31 UTC 2016

Entrez.parse was written for a reason, to parse complex xml data so that it easy to extract citation data from it. Entrez.read, does indeed work, but the output contains such a complex data structure, it is a non-trivial exercise to parse it.

Entrez.parse was working for a very long time, but is no longer working.  Try the following example from Biopython documentation <http://biopython.org/DIST/docs/api/Bio.Entrez-module.html>:

from Bio import Entrez
Entrez.email = "Your.Name.Here at example.org"
handle = Entrez.efetch("pubmed", id="19304878,14630660", retmode="xml")
records = Entrez.parse(handle)
for record in records:
    print(record['MedlineCitation']['Article']['ArticleTitle’])

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/user/anaconda/lib/python2.7/site-packages/Bio/Entrez/Parser.py", line 302, in parse
    raise ValueError("The XML file does not represent a list. Please use Entrez.read instead of Entrez.parse")
ValueError: The XML file does not represent a list. Please use Entrez.read instead of Entrez.parse

I have reproduced this error on Mac OS X and also a Linux machine.  Peter has also reproduced the problem.

Can you rewrite the above example so that it works with Entrez.read to print out the “ArticleTitle” data?  A better solution of course is to fix Entrez.parse.  I have tried myself to fix this problem, but I am stumped.

Konrad

> On 21 Dec 2016, at 08:47, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> 
> In what sense is the current result from Entrez.read more difficult to parse than the previous result from Entrez.parse?
> As far as I can tell, Entrez.read and Entrez.parse are both working correctly.
> Best,
> -Michiel
> 
> 
> On Tuesday, December 20, 2016 1:43 PM, Konrad Koehler <konrad.koehler at mac.com> wrote:
> 
> 
> Then how does one parse the output? Entrez.parse used to work, but no longer. Apparently NCBI has made changes to their xml that has broken Entrez.parse. Entrez.read returns a complex data structure that is difficult to parse.
> If one adds "['PubmedArticle']" to line 302 of /Bio/Entrez/Parse.py so that it reads:
> records = self.stack[0]['PubmedArticle']
> this suppresses the error message, but it mysteriously returns only the strings "PubmedArticle" and "PubmedBookArticle" and not the citation. Any ideas?
> 
> Konrad
> 
>> On 20 Dec 2016, at 05:16, Michiel de Hoon <mjldehoon at yahoo.com <mailto:mjldehoon at yahoo.com>> wrote:
>> 
>> Entrez.read works for me for the example shown.
>> 
>> Best,
>> -Michiel
>> 
>> 
>> On Sunday, December 18, 2016 11:57 AM, Peter Cock <p.j.a.cock at googlemail.com <mailto:p.j.a.cock at googlemail.com>> wrote:
>> 
>> 
>> On Sun, Dec 18, 2016 at 2:50 AM, Peter Cock <p.j.a.cock at googlemail.com <mailto:p.j.a.cock at googlemail.com>> wrote:
>> > On Thu, Dec 15, 2016 at 7:37 PM, Konrad Koehler <konrad.koehler at mac.com <mailto:konrad.koehler at mac.com>> wrote:
>> >> Hello everyone,
>> >>
>> >> I have been using Entrez.parse for years without any errors.  However just
>> >> in the last day or two, it stopped working.  I have been able to reproduce
>> >> the error using the following example from the biopython Package Entrez
>> >> documentation:
>> >>
>> >
>> > I can reproduce this. The XML looks sensible, two <PubmedArticle>
>> > tags:
>> >
>> > <?xml version="1.0" ?>
>> > <!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st
>> > January 2017//EN"
>> > "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_170101.dtd <https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_170101.dtd>">
>> > <PubmedArticleSet>
>> > <PubmedArticle>
>> >    <MedlineCitation Status="MEDLINE" Owner="NLM">
>> >        <PMID Version="1">19304878</PMID>
>> >        ...
>> >    </MedlineCitation>
>> >    <PubmedData>
>> >        ...
>> >    </PubmedData>
>> > </PubmedArticle>
>> > <PubmedArticle>
>> >    <MedlineCitation Status="MEDLINE" Owner="NLM">
>> >        <PMID Version="1">14630660</PMID>
>> >        ...
>> >    </MedlineCitation>
>> >    <PubmedData>
>> >        ...
>> >    </PubmedData>
>> > </PubmedArticle>
>> > </PubmedArticleSet>
>> >
>> > Note however it is using a new DTD file for Jan 2017,
>> >
>> > https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_170101.dtd <https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_170101.dtd>
>> >
>> >
>> >> Does anyone have any suggestions on how to get Entrez.parse working again? I
>> >> am also curious why this stopped working.  Has the NCBI server changed?
>> >>
>> >
>> > I would guess that the NCBI changed something subtly. Michiel?
>> >
>> > Peter
>> 
>> Logged on GitHub,
>> 
>> https://github.com/biopython/biopython/issues/1027 <https://github.com/biopython/biopython/issues/1027>
>> 
>> 
>> Peter
>> 
>> 
> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20161221/416ed6d3/attachment-0001.html>