[Biopython] PubmedCentral XML parsing

Thu Apr 25 19:16:49 UTC 2013

Hi Peter

Thanks a lot. I am getting an error when trying to parse with Entrez.parse. I download the nxml file prior to parsing, using PMC's FTP server in order to avoid their bulk downloading restrictions. Anyway, the code I am using is quite simple (with ipython):

In [1]: from Bio import Entrez

In [2]: handle = open('nihms83342.nxml')

In [3]: records = Entrez.parse(handle)

In [4]: for i in records:
   ...:     print i
   ...:
---------------------------------------------------------------------------
NotXMLError                               Traceback (most recent call last)
<ipython-input-4-82461854c9e7> in <module>()
----> 1 for i in records:
      2     print i
      3

/Library/Python/2.7/site-packages/Bio/Entrez/Parser.pyc in parse(self, handle)
    229                         # We did not see the initial <!xml declaration, so
    230                         # probably the input data is not in XML format.
--> 231                         raise NotXMLError("XML declaration not found")
    232                 self.parser.Parse("", True)
    233                 self.parser = None

NotXMLError: Failed to parse the XML data (XML declaration not found). Please make sure that the input data are in XML format.

And the file header is

<?xml version="1.0"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Archiving and Interchange DTD v2.3 20070202//EN" "archivearticle.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" article-type="research-article" xml:lang="EN">
	<?properties open_access?>
	<?properties manuscript?>
	<front>
		<journal-meta>

Is there a different way of parsing this file?

Thanks in advance

Paulo

On 2013-04-25, at 3:05 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> On Thu, Apr 25, 2013 at 7:42 PM, Paulo Nuin <nuin at genedrift.org> wrote:
>> Hi
>> 
>> What would be the most direct way of parsing XML files downloaded from
>> PubmedCentral ftp using BioPython?  These are files that use the
>> archivearticle.dtd and when parsed using non-DTD based code generate broken
>> paragraphs on the body of the document due to < > between <p> items of the
>> body.
>> 
>> Thanks in advance
>> 
>> Paulo
> 
> The Bio.Entrez parser is DTD based, and might suit your needs.
> 
> Peter