[Bioperl-l] Bio::DB::GenBank and large number of requests
Tristan Lefebure
tristan.lefebure at gmail.com
Tue Jan 29 17:19:23 UTC 2008
Hello,
I would like to download a large number of sequences from GenBank (122,146 to be exact) following a list of accession numbers.
I first investigated around Bio::DB::EUtilities, but got lost and finally used Bio::DB::GenBank.
My script works well for short request, but it gives the following error with the long request:
------------- EXCEPTION: Bio::Root::Exception -------------
MSG: WebDBSeqI Request Error:
500 short write
Content-Type: text/plain
Client-Date: Tue, 29 Jan 2008 17:22:46 GMT
Client-Warning: Internal response
500 short write
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/local/share/perl/5.8.8/Bio/Root/Root.pm:359
STACK: Bio::DB::WebDBSeqI::_request /usr/local/share/perl/5.8.8/Bio/DB/WebDBSeqI.pm:685
STACK: Bio::DB::WebDBSeqI::get_seq_stream /usr/local/share/perl/5.8.8/Bio/DB/WebDBSeqI.pm:472
STACK: Bio::DB::NCBIHelper::get_Stream_by_acc /usr/local/share/perl/5.8.8/Bio/DB/NCBIHelper.pm:361
STACK: ./fetch_from_genbank.pl:58
---------------------------------------------------------
Does that mean that we can only fetch 500 sequences at a time?
Should I split my list in 500 ids framents and submit them one after the other?
Any suggestions very welcomed...
Thanks,
-Tristan
Here is the script:
##################################
use strict;
use warnings;
use Bio::DB::GenBank;
# use Bio::DB::EUtilities;
use Bio::SeqIO;
use Getopt::Long;
# 2008-01-22 T Lefebure
# I tried to use Bio::DB::EUtilities without much succes and get back to Bio::DB::GenBank.
# The following procedure is not really good as the stream is first copied to a temporary file,
# and than re-used by BioPerl to generate the final file.
my $db = 'nucleotide';
my $format = 'genbank';
my $help= '';
my $dformat = 'gb';
GetOptions(
'help|?' => \$help,
'format=s' => \$format,
'database=s' => \$db,
);
my $printhelp = "\nUsage: $0 [options] <list of ids or acc> <output>
Will download the corresponding data from GenBank. BioPerl is required.
Options:
-h
print this help
-format: genbank|fasta|...
give output format (default=genbank)
-database: nucleotide|genome|protein|...
define the database to search in (default=nucleotide)
The full description of the options can be find at http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html\n";
if ($#ARGV<1) {
print $printhelp;
exit;
}
open LIST, $ARGV[0];
my @list = <LIST>;
if ($format eq 'fasta') { $dformat = 'fasta' }
my $gb = new Bio::DB::GenBank( -retrievaltype => 'tempfile',
-format => $dformat,
-db => $db,
);
my $seqio = $gb->get_Stream_by_acc(\@list);
my $seqout = Bio::SeqIO->new( -file => ">$ARGV[1]",
-format => $format,
);
while (my $seqo = $seqio->next_seq ) {
print $seqo->id, "\n";
$seqout->write_seq($seqo);
}
More information about the Bioperl-l
mailing list