[Bioperl-l] Bio::DB::GenBank and large number of requests
Chris Fields
cjfields at uiuc.edu
Wed Jan 30 20:42:14 UTC 2008
When using Bio::DB::EUtilities (from bioperl-live) this works for me:
use Bio::DB::EUtilities;
# get array of IDs somehow, in @ids
my ($start, $chunk, $last) = (0, 100, $#ids);
my $factory = Bio::DB::EUtilities->new(-eutil => 'efetch',
-db => 'protein',
-rettype => 'genbank');
my $ct = 1; # used to denote separate files
my $tries = 0; # server attempts
while ($start < $last) {
# want seqs in chunk size of 100 (set above)
my $end = ($start + $chunk - 1 ) < $last ? ($start + $chunk -
1) : $last;
# grab slice of IDs
my @sub = @ids[$start..$end];
# pass to agent
$factory->set_parameters(-id => \@sub );
eval {
# check server response, if good send to file
$factory->get_Response(-file => ">seqs_$ct.gb");
};
# ERROR!
if ($@) {
$tries++;
if ($tries <= 10) {
warn("Server problem on attempt $tries:$@.\nTrying
again...");
redo;
} else {
die("Repeated server issues after $tries attempts.");
# could warn and just skip this batch of accs using 'next'
}
}
$start = $end+1;
$ct++;
$tries = 0;
}
chris
On Jan 30, 2008, at 8:56 AM, Tristan Lefebure wrote:
> Thank you both!
>
> Just in case it might be usefull for someone else, here are my
> ramblings:
>
> 1. I first tried to adapt my script and fetch 500 sequences at a
> time. It works, except that ~40% of the time NCBI gives the
> following error and my script crashed:
>
> ------------- EXCEPTION: Bio::Root::Exception -------------
> MSG: WebDBSeqI Request Error:
> [...]
> The proxy server received an invalid
> response from an upstream server.
> [...]
> STACK: Error::throw
> STACK: Bio::Root::Root::throw /usr/local/share/perl/5.8.8/Bio/Root/
> Root.pm:359
> STACK: Bio::DB::WebDBSeqI::_request /usr/local/share/perl/5.8.8/Bio/
> DB/WebDBSeqI.pm:685
> STACK: Bio::DB::WebDBSeqI::get_seq_stream /usr/local/share/perl/
> 5.8.8/Bio/DB/WebDBSeqI.pm:472
> STACK: Bio::DB::NCBIHelper::get_Stream_by_acc /usr/local/share/perl/
> 5.8.8/Bio/DB/NCBIHelper.pm:361
> STACK: ./fetch_from_genbank.pl:68
> -----------------------------------------------------------
>
> I tried to modify the script so that when the retrieval of a 500
> sequence block crashes, it continues with the other blocks, but I
> was unsuccessfull. It probably needs some better understanding of
> BioPerl errors...
> Here is the section of the script that was modified:
> #########
> my $n_seq = scalar @list;
> my @aborted;
>
> for (my $i=1; $i<=$n_seq; $i += 500) {
> print "Fetching sequences $i to ", $i+499, ": ";
> my $start = $i -1;
> my $end = $i + 500 -1;
> my @red_list = @list[$start .. $end];
> my $gb = new Bio::DB::GenBank( -retrievaltype => 'tempfile',
> -format => $dformat,
> -db => $db,
> );
>
> my $seqio;
> unless( $seqio = $gb->get_Stream_by_acc(\@red_list)) {
> print "Aborted, resubmit latter\n";
> push @aborted, @red_list;
> next;
> }
>
> my $seqout = Bio::SeqIO->new( -file => ">$ARGV[1].$i",
> -format => $format,
> );
> while (my $seqo = $seqio->next_seq ) {
> # print $seqo->id, "\n";
> $seqout->write_seq($seqo);
> }
> print "Done\n";
> }
>
> if (@aborted) {
> open OUT, ">aborted_fetching.AN";
> foreach (@aborted) { print OUT $_ };
> }
> ##########
>
>
> 2. So I moved to the second solution and tried batchentrez. I cut my
> 120,000 long AN list into 10,000 long pieces using split:
> split -l 10000 full_list.AN splitted_list_
>
> and then submitted the 13 lists one by one. I must say that I don't
> really like using a web-interface to fetch data, and here the most
> ennoying part is that you end up with a regular Entrez/GenBank
> webpage: select your format, export to file, chosse file name... and
> have to do it many times.
> It is too much prone to human and web-browser errors for my taste,
> but it worked.
> Nevertheless there is some caveats:
> - some downloaded files were incomplete (~10%) and you have to
> restart it
> - you can't submit several lists in the same time (otherwise the
> same cookie will be used and you'll end up with several identical
> files)
>
> -Tristan
>
> On Tuesday 29 January 2008 13:44:16 you wrote:
>> Forgot about that one; it's definitely a better way to do it if you
>> have the GI/accessions.
>>
>> chris
>>
>> On Jan 29, 2008, at 12:32 PM, Alexander Kozik wrote:
>>> you don't need to use bioperl to accomplish this task, to download
>>> several thousand sequences based on accession ID list.
>>>
>>> NCBI batch Entrez can do that:
>>> http://www.ncbi.nlm.nih.gov/sites/batchentrez
>>>
>>> just submit a large list of IDs, select database, and download.
>>>
>>> you can submit ~50,000 IDs in one file usually without problems.
>>> it may not return results if a list is larger than ~100,000 IDs
>>>
>>> --
>>> Alexander Kozik
>>> Bioinformatics Specialist
>>> Genome and Biomedical Sciences Facility
>>> 451 Health Sciences Drive
>>> Genome Center, 4-th floor, room 4302
>>> University of California
>>> Davis, CA 95616-8816
>>> Phone: (530) 754-9127
>>> email#1: akozik at atgc.org
>>> email#2: akozik at gmail.com
>>> web: http://www.atgc.org/
>>>
>>> Chris Fields wrote:
>>>> Yes, you can only retrieve ~500 sequences at a time using either
>>>> Bio::DB::GenBank. Both Bio::DB::GenBank and Bio::DB::EUtilities
>>>> interact with NCBI's EUtilities (the former module returns raw data
>>>> from the URL to be processed later, the latter module returns
>>>> Bio::Seq/Bio::SeqIO objects).
>>>> http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=coursework.section.large-d
>>>> atasets You can usually post more IDs using epost and fetch
>>>> sequence
>>>> referring to the WebEnv/key combo (batch posting). I try to make
>>>> this a bit easier with EUtilities but it is woefully lacking in
>>>> documentation (my fault), but there is some code up on the wiki
>>>> which should work.
>>>> chris
>>>>
>>>> On Jan 29, 2008, at 11:19 AM, Tristan Lefebure wrote:
>>>>> Hello,
>>>>>
>>>>> I would like to download a large number of sequences from GenBank
>>>>> (122,146 to be exact) following a list of accession numbers.
>>>>> I first investigated around Bio::DB::EUtilities, but got lost and
>>>>> finally used Bio::DB::GenBank.
>>>>> My script works well for short request, but it gives the following
>>>>> error with the long request:
>>>>>
>>>>> ------------- EXCEPTION: Bio::Root::Exception -------------
>>>>> MSG: WebDBSeqI Request Error:
>>>>> 500 short write
>>>>> Content-Type: text/plain
>>>>> Client-Date: Tue, 29 Jan 2008 17:22:46 GMT
>>>>> Client-Warning: Internal response
>>>>>
>>>>> 500 short write
>>>>>
>>>>> STACK: Error::throw
>>>>> STACK: Bio::Root::Root::throw /usr/local/share/perl/5.8.8/Bio/
>>>>> Root/
>>>>> Root.pm:359
>>>>> STACK: Bio::DB::WebDBSeqI::_request /usr/local/share/perl/5.8.8/
>>>>> Bio/DB/WebDBSeqI.pm:685
>>>>> STACK: Bio::DB::WebDBSeqI::get_seq_stream /usr/local/share/perl/
>>>>> 5.8.8/Bio/DB/WebDBSeqI.pm:472
>>>>> STACK: Bio::DB::NCBIHelper::get_Stream_by_acc /usr/local/share/
>>>>> perl/5.8.8/Bio/DB/NCBIHelper.pm:361
>>>>> STACK: ./fetch_from_genbank.pl:58
>>>>> ---------------------------------------------------------
>>>>>
>>>>> Does that mean that we can only fetch 500 sequences at a time?
>>>>> Should I split my list in 500 ids framents and submit them one
>>>>> after the other?
>>>>>
>>>>> Any suggestions very welcomed...
>>>>> Thanks,
>>>>> -Tristan
>>>>>
>>>>>
>>>>> Here is the script:
>>>>>
>>>>> ##################################
>>>>> use strict;
>>>>> use warnings;
>>>>> use Bio::DB::GenBank;
>>>>> # use Bio::DB::EUtilities;
>>>>> use Bio::SeqIO;
>>>>> use Getopt::Long;
>>>>>
>>>>> # 2008-01-22 T Lefebure
>>>>> # I tried to use Bio::DB::EUtilities without much succes and get
>>>>> back to Bio::DB::GenBank.
>>>>> # The following procedure is not really good as the stream is
>>>>> first copied to a temporary file,
>>>>> # and than re-used by BioPerl to generate the final file.
>>>>>
>>>>> my $db = 'nucleotide';
>>>>> my $format = 'genbank';
>>>>> my $help= '';
>>>>> my $dformat = 'gb';
>>>>>
>>>>> GetOptions(
>>>>> 'help|?' => \$help,
>>>>> 'format=s' => \$format,
>>>>> 'database=s' => \$db,
>>>>> );
>>>>>
>>>>>
>>>>> my $printhelp = "\nUsage: $0 [options] <list of ids or acc>
>>>>> <output>
>>>>>
>>>>> Will download the corresponding data from GenBank. BioPerl is
>>>>> required.
>>>>>
>>>>> Options:
>>>>> -h
>>>>> print this help
>>>>> -format: genbank|fasta|...
>>>>> give output format (default=genbank)
>>>>> -database: nucleotide|genome|protein|...
>>>>> define the database to search in (default=nucleotide)
>>>>>
>>>>> The full description of the options can be find at
>>>>> http://www.ncbi.nlm.nih.gov/entrez/query/static/
>>>>> efetchseq_help.html
>>>>> \n";
>>>>>
>>>>> if ($#ARGV<1) {
>>>>> print $printhelp;
>>>>> exit;
>>>>> }
>>>>>
>>>>> open LIST, $ARGV[0];
>>>>> my @list = <LIST>;
>>>>>
>>>>> if ($format eq 'fasta') { $dformat = 'fasta' }
>>>>>
>>>>> my $gb = new Bio::DB::GenBank( -retrievaltype => 'tempfile',
>>>>> -format => $dformat,
>>>>> -db => $db,
>>>>> );
>>>>> my $seqio = $gb->get_Stream_by_acc(\@list);
>>>>>
>>>>> my $seqout = Bio::SeqIO->new( -file => ">$ARGV[1]",
>>>>> -format => $format,
>>>>> );
>>>>> while (my $seqo = $seqio->next_seq ) {
>>>>> print $seqo->id, "\n";
>>>>> $seqout->write_seq($seqo);
>>>>> }
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>> Christopher Fields
>>>> Postdoctoral Researcher
>>>> Lab of Dr. Robert Switzer
>>>> Dept of Biochemistry
>>>> University of Illinois Urbana-Champaign
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> Christopher Fields
>> Postdoctoral Researcher
>> Lab of Dr. Robert Switzer
>> Dept of Biochemistry
>> University of Illinois Urbana-Champaign
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
More information about the Bioperl-l
mailing list