[Bioperl-l] PMC and EUtilities, was bioperl::db

Mon Apr 30 15:15:16 UTC 2007

Bernd,

As a pretext to this discussion, I am in the middle of refactoring  
EUtilities; the next incarnation should have a similar API but will  
likely set parameters via simpler methods (no need for all the getter/ 
setters).

You'll likely have to parse out the tags yourself, AFAIK there is no  
BioPerl XML parser for PMC XML and a quick grep search turns up  
nothing but PubMed parsers.  If you aren't familiar with XML parsing  
you could try XML::Simple to get at what you want.  I would pass the  
XML in as small chunks (maybe by retrieving them in batches of 100 or  
less) and initially use Data::Dumper to determine the data structure  
XML::Simple returns (PMC XML has attributes and elements, so the  
structure will be more complex).  Then just iterate through articles  
and grab what you want.

I think the predominant portion of articles in PubMed Central are  
free full-text access (if not all):

http://www.pubmedcentral.nih.gov/about/faq.html#q9

You can retrieve them via ftp:

ftp://ftp.ncbi.nlm.nih.gov/pub/pmc

which contains an index file of all articles and their dir. location  
(the readme gives more info).

chris

On Apr 30, 2007, at 4:07 AM, Bernd Mueller wrote:

> Hello,
>
> I think so. The ids from my wanted articles are retrieved by  
> Bio::DB::EUtilities::esearch. Afterwards I download the articles  
> with Bio::DB::EUtilities::efetch. It is only possible to download  
> in XML format from PMC. So post processing is actually needed  
> because I want the articles in plain format.
>
> But I don't know why I have results of non-free articles, i.e.  
> abstracts where full articles should be found with a query  
> constraining to only free fulltext. In the query I limit the search  
> with the filter "AND free fulltext[filter]".Probably this is a  
> matter concerning not directly bioperl but the eutilities interface  
> of PMC.
>
> Regards,
> Bernd