[Bioperl-l] Problem When searching PubMed

Thu Dec 18 17:46:24 UTC 2008

On Dec 17, 2008, at 8:44 AM, analia at deb.uminho.pt wrote:

>> Hi all,
>>
>> I am using your BioPerl eutils search modules. Probably it is my  
>> fault, but when I try to run a query for which PubMed returns a  
>> large number of documents, the get_count never meets the overall  
>> figures! When retrieving the results, the program fails eventually.
>>
>> Any idea of what I am doing wrong? I am sending to examples...
>>
>> Cheers,
>> Anália
>>
>> Here are two examples:
>>
>>
>> my $biblio = new Bio::Biblio (-access => 'eutils');
>> print $biblio->find ("Escherichia coli")->get_count . "\n";
>>
>>
>> Output : 100000 whereas PubMed says 258953!
>>
>> my %docs=();
>> my $biblio = new Bio::Biblio (-access => 'eutils');
>> my $collection = $biblio = $biblio->find("Escherichia coli");
>>
>> while ( $collection->has_next) {
>>     my $reader = Bio::Biblio::IO->new ('-data' => $collection- 
>> >get_next ,
>>                                   '-format' => 'medlinexml');
>>     if(my $citation = $reader->next_bibref()){
>>             $docs{$citation->{'_pmid'}}{title}=$citation->{'_title'};
>>       $docs{$citation->{'_pmid'}}{type}=$citation->{'_type'};
>>       $docs{$citation->{'_pmid'}}{date}=$citation->{'_date'}||"null";
>>   }
>> }
>> print scalar(keys %docs);
>>
>> Output: never get output, crashes the perl interpreter!!!

The 100000 is likely from retmax being set somewhere in Bio::Biblio  
code.  It appears that can be changed via $Bio::DB::Biblio::eutils:: 
$MAX_RECORDS.

Most importantly I can confirm the seg fault.  It doesn't surprise me  
too much if one thinks about what you are trying to do (attempting to  
retrieve ~100000 records in one go).  Bio::DB::Biblio::eutils  
apparently isn't set up to lazily retrieve and parse XML (or to send  
back small bits of XML by piecemeal retrieval of IDs), so you will  
likely run into problems like this with very large XML files unless  
you attempt retrieving the XML in small chunks.

Unfortunately, Bio::Biblio doesn't appear to have a workaround either  
so it's a bug (there needs to be a way to retrieve a list of IDs).   
Following is a workaround where you can use Bio::DB::EUtilities and  
pipe in returned XML to the parser (this iterates through the IDs  
stored on NCBI's server via 'usehistory').  Note that this calls the  
server over and over with a timeout in between; I would highly  
recommend NOT retrieving 100000 records at NCBI peak use times as  
you'll probably run into server issues (I have a fallback built in  
below JIC).

chris

# BEGIN CODE

#!/usr/bin/perl -w

use strict;
use warnings;
use Bio::Biblio::IO;
use Bio::DB::EUtilities;

my $eutil = Bio::DB::EUtilities->new(-eutil => 'esearch',
                                      -db => 'pubmed',
                                      -term => 'Escherichia coli',
                                      -usehistory => 'y',
                                      #-retmax => 100000
                                      );

# uncommenting the above retmax changes the below count
my $count = $eutil->get_count;

print "Count: $count\n";
my $hist = $eutil->next_History || die 'No history data returned';

my ($start, $end);

$eutil->set_parameters(-eutil => 'efetch',
                          -history => $hist);

# change retmax to vary chunk size returned
my ($retmax, $retstart) = (500,0);
my $tries = 0;

RETRIEVE_XML:
while ($retstart < $count) {
     $eutil->set_parameters(-retmax => $retmax,
                             -retstart => $retstart);
     my $xml;
     eval{
         $xml = $eutil->get_Response()->content;
     };
     if ($@) {
         die "Server error: $@.  Try again later" if $tries == 5;
         print STDERR "Server error, redo #$tries\n";
         $tries++ && redo RETRIEVE_XML;
     }

     # then parse smaller XML chunk here

     # If you want to persist data and save some memory,
     # maybe tie below hash to a DB_File outside of this loop

     my %docs=();

     my $reader = Bio::Biblio::IO->new ('-data' => $xml,
                                   '-format' => 'medlinexml');
     while (my $citation = $reader->next_bibref()){

         $docs{$citation->{'_pmid'}}{title}=$citation->{'_title'};
         $docs{$citation->{'_pmid'}}{type}=$citation->{'_type'};
         $docs{$citation->{'_pmid'}}{date}=$citation->{'_date'}||"null";
     }

     print "Retrieved ".($retstart+1)." to ".($retstart+$retmax)."\n";

     print scalar(keys %docs)."\n";

     $retstart += $retmax;
}