[Biopython] PubmedCentral XML parsing
Paulo Nuin
nuin at genedrift.org
Thu Apr 25 19:16:49 UTC 2013
Hi Peter
Thanks a lot. I am getting an error when trying to parse with Entrez.parse. I download the nxml file prior to parsing, using PMC's FTP server in order to avoid their bulk downloading restrictions. Anyway, the code I am using is quite simple (with ipython):
In [1]: from Bio import Entrez
In [2]: handle = open('nihms83342.nxml')
In [3]: records = Entrez.parse(handle)
In [4]: for i in records:
...: print i
...:
---------------------------------------------------------------------------
NotXMLError Traceback (most recent call last)
<ipython-input-4-82461854c9e7> in <module>()
----> 1 for i in records:
2 print i
3
/Library/Python/2.7/site-packages/Bio/Entrez/Parser.pyc in parse(self, handle)
229 # We did not see the initial <!xml declaration, so
230 # probably the input data is not in XML format.
--> 231 raise NotXMLError("XML declaration not found")
232 self.parser.Parse("", True)
233 self.parser = None
NotXMLError: Failed to parse the XML data (XML declaration not found). Please make sure that the input data are in XML format.
And the file header is
<?xml version="1.0"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Archiving and Interchange DTD v2.3 20070202//EN" "archivearticle.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" article-type="research-article" xml:lang="EN">
<?properties open_access?>
<?properties manuscript?>
<front>
<journal-meta>
Is there a different way of parsing this file?
Thanks in advance
Paulo
On 2013-04-25, at 3:05 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Thu, Apr 25, 2013 at 7:42 PM, Paulo Nuin <nuin at genedrift.org> wrote:
>> Hi
>>
>> What would be the most direct way of parsing XML files downloaded from
>> PubmedCentral ftp using BioPython? These are files that use the
>> archivearticle.dtd and when parsed using non-DTD based code generate broken
>> paragraphs on the body of the document due to < > between <p> items of the
>> body.
>>
>> Thanks in advance
>>
>> Paulo
>
> The Bio.Entrez parser is DTD based, and might suit your needs.
>
> Peter
More information about the Biopython
mailing list