[Bioperl-l] Bio::DB::GenBank and large number of requests

Chris Fields cjfields at uiuc.edu
Wed Jan 30 20:42:14 UTC 2008


When using Bio::DB::EUtilities (from bioperl-live) this works for me:

use Bio::DB::EUtilities;

# get array of IDs somehow, in @ids

my ($start, $chunk, $last) = (0, 100, $#ids);

my $factory = Bio::DB::EUtilities->new(-eutil => 'efetch',
                      -db => 'protein',
                      -rettype => 'genbank');

my $ct = 1; # used to denote separate files
my $tries = 0; # server attempts

while ($start < $last) {
     # want seqs in chunk size of 100 (set above)
     my $end = ($start + $chunk - 1 ) < $last ? ($start + $chunk -  
1) : $last;
     # grab slice of IDs
     my @sub = @ids[$start..$end];

     # pass to agent
     $factory->set_parameters(-id => \@sub );

     eval {
         # check server response, if good send to file
         $factory->get_Response(-file => ">seqs_$ct.gb");
     };

     # ERROR!
     if ($@) {
         $tries++;
         if ($tries <= 10) {
             warn("Server problem on attempt $tries:$@.\nTrying  
again...");
             redo;
         } else {
             die("Repeated server issues after $tries attempts.");
             # could warn and just skip this batch of accs using 'next'
         }
     }

     $start = $end+1;
     $ct++;
     $tries = 0;
}



chris

On Jan 30, 2008, at 8:56 AM, Tristan Lefebure wrote:

> Thank you both!
>
> Just in case it might be usefull for someone else, here are my  
> ramblings:
>
> 1. I first tried to adapt my script and fetch 500 sequences at a  
> time. It works, except that ~40% of the time NCBI gives the  
> following error and my script crashed:
>
> ------------- EXCEPTION: Bio::Root::Exception -------------
> MSG: WebDBSeqI Request Error:
> [...]
>    The proxy server received an invalid
>    response from an upstream server.
> [...]
> STACK: Error::throw
> STACK: Bio::Root::Root::throw /usr/local/share/perl/5.8.8/Bio/Root/ 
> Root.pm:359
> STACK: Bio::DB::WebDBSeqI::_request /usr/local/share/perl/5.8.8/Bio/ 
> DB/WebDBSeqI.pm:685
> STACK: Bio::DB::WebDBSeqI::get_seq_stream /usr/local/share/perl/ 
> 5.8.8/Bio/DB/WebDBSeqI.pm:472
> STACK: Bio::DB::NCBIHelper::get_Stream_by_acc /usr/local/share/perl/ 
> 5.8.8/Bio/DB/NCBIHelper.pm:361
> STACK: ./fetch_from_genbank.pl:68
> -----------------------------------------------------------
>
> I tried to modify the script so that when the retrieval of a 500  
> sequence block crashes, it continues with the other blocks, but I  
> was unsuccessfull. It probably needs some better understanding of  
> BioPerl errors...
> Here is the section of the script that was modified:
> #########
> my $n_seq = scalar @list;
> my @aborted;
>
> for (my $i=1; $i<=$n_seq; $i += 500) {
> 	print "Fetching sequences $i to ", $i+499, ": ";
> 	my $start = $i -1;
> 	my $end = $i + 500 -1;
> 	my @red_list = @list[$start .. $end];
> 	my $gb = new Bio::DB::GenBank(	-retrievaltype => 'tempfile',
> 					-format => $dformat,
> 					-db => $db,
> 				);
>
> 	my $seqio;
> 	unless(	$seqio = $gb->get_Stream_by_acc(\@red_list)) {
> 		print "Aborted, resubmit latter\n";
> 		push @aborted, @red_list;
> 		next;
> 	}
> 	
> 	my $seqout = Bio::SeqIO->new( -file => ">$ARGV[1].$i",
> 					-format => $format,
> 				);
> 	while (my $seqo = $seqio->next_seq ) {
> # 		print $seqo->id, "\n";
> 		$seqout->write_seq($seqo);
> 	}
> 	print "Done\n";
> }
>
> if (@aborted) {
> 	open OUT, ">aborted_fetching.AN";
> 	foreach (@aborted) { print OUT $_ };
> }
> ##########
>
>
> 2. So I moved to the second solution and tried batchentrez. I cut my  
> 120,000 long AN list into 10,000 long pieces using split:
> split -l 10000 full_list.AN splitted_list_
>
> and then submitted the 13 lists one by one. I must say that I don't  
> really like using a web-interface to fetch data, and here the most  
> ennoying part is that you end up with a regular Entrez/GenBank  
> webpage: select your format, export to file, chosse file name... and  
> have to do it many times.
> It is too much prone to human and web-browser errors for my taste,  
> but it worked.
> Nevertheless there is some caveats:
> - some downloaded files were incomplete (~10%) and you have to  
> restart it
> - you can't submit several lists in the same time (otherwise the  
> same cookie will be used and you'll end up with several identical  
> files)
>
> -Tristan
>
> On Tuesday 29 January 2008 13:44:16 you wrote:
>> Forgot about that one; it's definitely a better way to do it if you
>> have the GI/accessions.
>>
>> chris
>>
>> On Jan 29, 2008, at 12:32 PM, Alexander Kozik wrote:
>>> you don't need to use bioperl to accomplish this task, to download
>>> several thousand sequences based on accession ID list.
>>>
>>> NCBI batch Entrez can do that:
>>> http://www.ncbi.nlm.nih.gov/sites/batchentrez
>>>
>>> just submit a large list of IDs, select database, and download.
>>>
>>> you can submit ~50,000 IDs in one file usually without problems.
>>> it may not return results if a list is larger than ~100,000 IDs
>>>
>>> --
>>> Alexander Kozik
>>> Bioinformatics Specialist
>>> Genome and Biomedical Sciences Facility
>>> 451 Health Sciences Drive
>>> Genome Center, 4-th floor, room 4302
>>> University of California
>>> Davis, CA 95616-8816
>>> Phone: (530) 754-9127
>>> email#1: akozik at atgc.org
>>> email#2: akozik at gmail.com
>>> web: http://www.atgc.org/
>>>
>>> Chris Fields wrote:
>>>> Yes, you can only retrieve ~500 sequences at a time using either
>>>> Bio::DB::GenBank.  Both Bio::DB::GenBank and Bio::DB::EUtilities
>>>> interact with NCBI's EUtilities (the former module returns raw data
>>>> from the URL to be processed later, the latter module returns
>>>> Bio::Seq/Bio::SeqIO objects).
>>>> http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=coursework.section.large-d
>>>> atasets You can usually post more IDs using epost and fetch  
>>>> sequence
>>>> referring to the WebEnv/key combo (batch posting).  I try to make
>>>> this a bit easier with EUtilities but it is woefully lacking in
>>>> documentation (my fault), but there is some code up on the wiki
>>>> which should work.
>>>> chris
>>>>
>>>> On Jan 29, 2008, at 11:19 AM, Tristan Lefebure wrote:
>>>>> Hello,
>>>>>
>>>>> I would like to download a large number of sequences from GenBank
>>>>> (122,146 to be exact) following a list of accession numbers.
>>>>> I first investigated around Bio::DB::EUtilities, but got lost and
>>>>> finally used Bio::DB::GenBank.
>>>>> My script works well for short request, but it gives the following
>>>>> error with the long request:
>>>>>
>>>>> ------------- EXCEPTION: Bio::Root::Exception -------------
>>>>> MSG: WebDBSeqI Request Error:
>>>>> 500 short write
>>>>> Content-Type: text/plain
>>>>> Client-Date: Tue, 29 Jan 2008 17:22:46 GMT
>>>>> Client-Warning: Internal response
>>>>>
>>>>> 500 short write
>>>>>
>>>>> STACK: Error::throw
>>>>> STACK: Bio::Root::Root::throw /usr/local/share/perl/5.8.8/Bio/ 
>>>>> Root/
>>>>> Root.pm:359
>>>>> STACK: Bio::DB::WebDBSeqI::_request /usr/local/share/perl/5.8.8/
>>>>> Bio/DB/WebDBSeqI.pm:685
>>>>> STACK: Bio::DB::WebDBSeqI::get_seq_stream /usr/local/share/perl/
>>>>> 5.8.8/Bio/DB/WebDBSeqI.pm:472
>>>>> STACK: Bio::DB::NCBIHelper::get_Stream_by_acc /usr/local/share/
>>>>> perl/5.8.8/Bio/DB/NCBIHelper.pm:361
>>>>> STACK: ./fetch_from_genbank.pl:58
>>>>> ---------------------------------------------------------
>>>>>
>>>>> Does that mean that we can only fetch 500 sequences at a time?
>>>>> Should I split my list in 500 ids framents and submit them one
>>>>> after the other?
>>>>>
>>>>> Any suggestions very welcomed...
>>>>> Thanks,
>>>>> -Tristan
>>>>>
>>>>>
>>>>> Here is the script:
>>>>>
>>>>> ##################################
>>>>> use strict;
>>>>> use warnings;
>>>>> use Bio::DB::GenBank;
>>>>> # use Bio::DB::EUtilities;
>>>>> use Bio::SeqIO;
>>>>> use Getopt::Long;
>>>>>
>>>>> # 2008-01-22 T Lefebure
>>>>> # I tried to use Bio::DB::EUtilities without much succes and get
>>>>> back to Bio::DB::GenBank.
>>>>> # The following procedure is not really good as the stream is
>>>>> first copied to a temporary file,
>>>>> # and than re-used by BioPerl to generate the final file.
>>>>>
>>>>> my $db = 'nucleotide';
>>>>> my $format = 'genbank';
>>>>> my $help= '';
>>>>> my $dformat = 'gb';
>>>>>
>>>>> GetOptions(
>>>>>   'help|?' => \$help,
>>>>>   'format=s'  => \$format,
>>>>>   'database=s'    => \$db,
>>>>> );
>>>>>
>>>>>
>>>>> my $printhelp = "\nUsage: $0 [options] <list of ids or acc>  
>>>>> <output>
>>>>>
>>>>> Will download the corresponding data from GenBank. BioPerl is
>>>>> required.
>>>>>
>>>>> Options:
>>>>>   -h
>>>>>       print this help
>>>>>   -format: genbank|fasta|...
>>>>>       give output format (default=genbank)
>>>>>   -database: nucleotide|genome|protein|...
>>>>>       define the database to search in (default=nucleotide)
>>>>>
>>>>> The full description of the options can be find at
>>>>> http://www.ncbi.nlm.nih.gov/entrez/query/static/ 
>>>>> efetchseq_help.html
>>>>> \n";
>>>>>
>>>>> if ($#ARGV<1) {
>>>>>   print $printhelp;
>>>>>   exit;
>>>>> }
>>>>>
>>>>> open LIST, $ARGV[0];
>>>>> my @list = <LIST>;
>>>>>
>>>>> if ($format eq 'fasta') { $dformat = 'fasta' }
>>>>>
>>>>> my $gb = new Bio::DB::GenBank(    -retrievaltype => 'tempfile',
>>>>>               -format => $dformat,
>>>>>               -db => $db,
>>>>>           );
>>>>> my $seqio = $gb->get_Stream_by_acc(\@list);
>>>>>
>>>>> my $seqout = Bio::SeqIO->new( -file => ">$ARGV[1]",
>>>>>               -format => $format,
>>>>>           );
>>>>> while (my $seqo = $seqio->next_seq ) {
>>>>>   print $seqo->id, "\n";
>>>>>   $seqout->write_seq($seqo);
>>>>> }
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>> Christopher Fields
>>>> Postdoctoral Researcher
>>>> Lab of Dr. Robert Switzer
>>>> Dept of Biochemistry
>>>> University of Illinois Urbana-Champaign
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> Christopher Fields
>> Postdoctoral Researcher
>> Lab of Dr. Robert Switzer
>> Dept of Biochemistry
>> University of Illinois Urbana-Champaign
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign






More information about the Bioperl-l mailing list