[Biopython] Entrez.parse error
Konrad Koehler
konrad.koehler at mac.com
Wed Dec 21 08:27:31 UTC 2016
Entrez.parse was written for a reason, to parse complex xml data so that it easy to extract citation data from it. Entrez.read, does indeed work, but the output contains such a complex data structure, it is a non-trivial exercise to parse it.
Entrez.parse was working for a very long time, but is no longer working. Try the following example from Biopython documentation <http://biopython.org/DIST/docs/api/Bio.Entrez-module.html>:
from Bio import Entrez
Entrez.email = "Your.Name.Here at example.org"
handle = Entrez.efetch("pubmed", id="19304878,14630660", retmode="xml")
records = Entrez.parse(handle)
for record in records:
print(record['MedlineCitation']['Article']['ArticleTitle’])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/user/anaconda/lib/python2.7/site-packages/Bio/Entrez/Parser.py", line 302, in parse
raise ValueError("The XML file does not represent a list. Please use Entrez.read instead of Entrez.parse")
ValueError: The XML file does not represent a list. Please use Entrez.read instead of Entrez.parse
I have reproduced this error on Mac OS X and also a Linux machine. Peter has also reproduced the problem.
Can you rewrite the above example so that it works with Entrez.read to print out the “ArticleTitle” data? A better solution of course is to fix Entrez.parse. I have tried myself to fix this problem, but I am stumped.
Konrad
> On 21 Dec 2016, at 08:47, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>
> In what sense is the current result from Entrez.read more difficult to parse than the previous result from Entrez.parse?
> As far as I can tell, Entrez.read and Entrez.parse are both working correctly.
> Best,
> -Michiel
>
>
> On Tuesday, December 20, 2016 1:43 PM, Konrad Koehler <konrad.koehler at mac.com> wrote:
>
>
> Then how does one parse the output? Entrez.parse used to work, but no longer. Apparently NCBI has made changes to their xml that has broken Entrez.parse. Entrez.read returns a complex data structure that is difficult to parse.
> If one adds "['PubmedArticle']" to line 302 of /Bio/Entrez/Parse.py so that it reads:
> records = self.stack[0]['PubmedArticle']
> this suppresses the error message, but it mysteriously returns only the strings "PubmedArticle" and "PubmedBookArticle" and not the citation. Any ideas?
>
> Konrad
>
>> On 20 Dec 2016, at 05:16, Michiel de Hoon <mjldehoon at yahoo.com <mailto:mjldehoon at yahoo.com>> wrote:
>>
>> Entrez.read works for me for the example shown.
>>
>> Best,
>> -Michiel
>>
>>
>> On Sunday, December 18, 2016 11:57 AM, Peter Cock <p.j.a.cock at googlemail.com <mailto:p.j.a.cock at googlemail.com>> wrote:
>>
>>
>> On Sun, Dec 18, 2016 at 2:50 AM, Peter Cock <p.j.a.cock at googlemail.com <mailto:p.j.a.cock at googlemail.com>> wrote:
>> > On Thu, Dec 15, 2016 at 7:37 PM, Konrad Koehler <konrad.koehler at mac.com <mailto:konrad.koehler at mac.com>> wrote:
>> >> Hello everyone,
>> >>
>> >> I have been using Entrez.parse for years without any errors. However just
>> >> in the last day or two, it stopped working. I have been able to reproduce
>> >> the error using the following example from the biopython Package Entrez
>> >> documentation:
>> >>
>> >
>> > I can reproduce this. The XML looks sensible, two <PubmedArticle>
>> > tags:
>> >
>> > <?xml version="1.0" ?>
>> > <!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st
>> > January 2017//EN"
>> > "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_170101.dtd <https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_170101.dtd>">
>> > <PubmedArticleSet>
>> > <PubmedArticle>
>> > <MedlineCitation Status="MEDLINE" Owner="NLM">
>> > <PMID Version="1">19304878</PMID>
>> > ...
>> > </MedlineCitation>
>> > <PubmedData>
>> > ...
>> > </PubmedData>
>> > </PubmedArticle>
>> > <PubmedArticle>
>> > <MedlineCitation Status="MEDLINE" Owner="NLM">
>> > <PMID Version="1">14630660</PMID>
>> > ...
>> > </MedlineCitation>
>> > <PubmedData>
>> > ...
>> > </PubmedData>
>> > </PubmedArticle>
>> > </PubmedArticleSet>
>> >
>> > Note however it is using a new DTD file for Jan 2017,
>> >
>> > https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_170101.dtd <https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_170101.dtd>
>> >
>> >
>> >> Does anyone have any suggestions on how to get Entrez.parse working again? I
>> >> am also curious why this stopped working. Has the NCBI server changed?
>> >>
>> >
>> > I would guess that the NCBI changed something subtly. Michiel?
>> >
>> > Peter
>>
>> Logged on GitHub,
>>
>> https://github.com/biopython/biopython/issues/1027 <https://github.com/biopython/biopython/issues/1027>
>>
>>
>> Peter
>>
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20161221/416ed6d3/attachment-0001.html>
More information about the Biopython
mailing list