[Biopython] Pubmeddata XML parsing with Entrez .fetch and .read

Thu Jul 15 02:52:44 UTC 2010

Sure, I am new, so there are probably errors, but how about something like a
demonstration appended to the end of the tutorial section at
http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc105

At core, the simple demonstration that type(record) calls a class object
rather than a list, and that foo.attributes, and foo.tag exist would be
helpful.  I am not using any of the sequence utilities, so admit that my
reading of those sections was brief.  Reiteration in the entrez parsing
sections is probably helpful for people like me.

A more verbose demonstration follows.

Again, thanks for the help.
Guy

8.11.1  Parsing Medline records [intervening text omitted]

At this point let’s address what these elements contain.  Consider
information found in the following statement.

>>> records[0]['PubmedData']

{u'ArticleIdList': ['btp163', '10.1093/bioinformatics/btp163', '19304878',
'PMC2682512'], u'PublicationStatus': 'ppublish', u'History': [{u'Month':
'3', u'Day': '20', u'Year': '2009'}, {u'Minute': '0', u'Month': '3', u'Day':
'24', u'Hour': '9', u'Year': '2009'}, {u'Minute': '0', u'Month': '3',
u'Day': '24', u'Hour': '9', u'Year': '2009'}, {u'Minute': '0', u'Month':
'7', u'Day': '10', u'Hour': '9', u'Year': '2009'}]}

It is important to recall that each item is a biopython class, rather than a
simply a dictionary or list item.  This can be verified by

>>>type(records[0]['PubmedData']['ArticleIdList']

Which returns <class 'Bio.Entrez.Parser.ListElement'> rather than <type
'list'>

This is important,  as the class item contains additional auxiliary
information as noted earlier.  One such piece of important auxillary info is
the XML tag attributes from the parsed XML.

In this case, the  original XML contained the following tags:

<ArticleIdList>

            <ArticleId IdType="pii">btp163</ArticleId>

            <ArticleId IdType="doi">10.1093/bioinformatics/btp163</ArticleId>

            <ArticleId IdType="pubmed">19304878</ArticleId>

            <ArticleId IdType="pmc">PMC2682512</ArticleId>

</ArticleIdList>

which have now been parsed into the u'ArticleIdList' dictionary key:

>>> L =records[0]['PubmedData']['ArticleIdList']

['btp163', '10.1093/bioinformatics/btp163', '19304878', 'PMC2682512']

Viewed as a simple list, these elements appear to lack the IdType
information.  However,  the IdType attribute from the <ArticleId> tag is
stored in the parsed data, and can be retrieved by calling “attributes” on
the biopython class object.

>>> for item in L:

...            print "%s - %s" % (item, item.attributes)

...

btp163 - {u'IdType': u'pii'}

10.1093/bioinformatics/btp163 - {u'IdType': u'doi'}

19304878 - {u'IdType': u'pubmed'}

PMC2682512 - {u'IdType': u'pmc'}

On Wed, Jul 14, 2010 at 5:33 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Wed, Jul 14, 2010 at 10:21 PM, Guy Eakin <guyeakin at gmail.com> wrote:
> > thanks.  I understood that it had a  .tag feature,but missed the
> > .attributes!
> > Awesome. And thank you for the quick reply.
> >
> > Guy
>
> No problem. Now you know the answer, can you suggest
> any clarifications to the documentation?
>
> Peter
>
> P.S. Try and CC the mailing list in replies.
>