[Bioperl-l] Bio::DB::GenBank and large number of requests

Thu Jan 31 09:34:31 UTC 2008

Hi,

I succeeded with a similar task using the seqhound database. I had a
list of > 200,000 gid numbers, but I guess it can work in a similar
fashion using accession numbers. Here is the script:

#!/usr/perl

use strict;
use warnings;

use Bio::Seq;
use Bio::SeqIO;
use Bio::DB::Query::GenBank;
use Bio::DB::SeqHound;

my $sh = new Bio::DB::SeqHound();

my($USAGE) = "$0 id_file\n\n";

unless(@ARGV) {
	print $USAGE;
	exit;
}

my $id_file = $ARGV[0];

open ID_FILE, "<$id_file" or die "error: $!";

while (<ID_FILE>) {
  chomp;
  my $id = $_;
  if (defined(my $seq_obj = $sh->get_Seq_by_gi($id))) {
    my $out = Bio::SeqIO->new(-format => 'fasta');
    $out->write_seq($seq_obj);
  } else {
    next;
  }
}

Best,

Georg

Tristan Lefebure <tristan.lefebure at gmail.com> writes:

> Hello,
>
> I would like to download a large number of sequences from GenBank (122,146 to be exact) following a list of accession numbers.
> I first investigated around Bio::DB::EUtilities, but got lost and finally used Bio::DB::GenBank. 
> My script works well for short request, but it gives the following error with the long request:
>
>  ------------- EXCEPTION: Bio::Root::Exception -------------
> MSG: WebDBSeqI Request Error:
> 500 short write
> Content-Type: text/plain
> Client-Date: Tue, 29 Jan 2008 17:22:46 GMT
> Client-Warning: Internal response
>
> 500 short write
>
> STACK: Error::throw
> STACK: Bio::Root::Root::throw /usr/local/share/perl/5.8.8/Bio/Root/Root.pm:359
> STACK: Bio::DB::WebDBSeqI::_request /usr/local/share/perl/5.8.8/Bio/DB/WebDBSeqI.pm:685
> STACK: Bio::DB::WebDBSeqI::get_seq_stream /usr/local/share/perl/5.8.8/Bio/DB/WebDBSeqI.pm:472
> STACK: Bio::DB::NCBIHelper::get_Stream_by_acc /usr/local/share/perl/5.8.8/Bio/DB/NCBIHelper.pm:361
> STACK: ./fetch_from_genbank.pl:58
> ---------------------------------------------------------
>
> Does that mean that we can only fetch 500 sequences at a time?
> Should I split my list in 500 ids framents and submit them one after the other?
>
> Any suggestions very welcomed...
> Thanks,
> -Tristan
>
>
> Here is the script:
>
> ##################################
> use strict;
> use warnings;
> use Bio::DB::GenBank;
> # use Bio::DB::EUtilities;
> use Bio::SeqIO;
> use Getopt::Long;
>
> # 2008-01-22 T Lefebure
> # I tried to use Bio::DB::EUtilities without much succes and get back to Bio::DB::GenBank.
> # The following procedure is not really good as the stream is first copied to a temporary file,
> # and than re-used by BioPerl to generate the final file.
>
> my $db = 'nucleotide';
> my $format = 'genbank';
> my $help= '';
> my $dformat = 'gb';
>
> GetOptions(
> 	'help|?' => \$help,
> 	'format=s'  => \$format,
> 	'database=s'	=> \$db,
> );
>
>
> my $printhelp = "\nUsage: $0 [options] <list of ids or acc> <output>
>
> Will download the corresponding data from GenBank. BioPerl is required.
>
> Options:
> 	-h
> 		print this help
> 	-format: genbank|fasta|...
> 		give output format (default=genbank)
> 	-database: nucleotide|genome|protein|...
> 		define the database to search in (default=nucleotide)
>
> The full description of the options can be find at http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html\n";
>
> if ($#ARGV<1) {
> 	print $printhelp;
> 	exit;
> }
>
> open LIST, $ARGV[0];
> my @list = <LIST>;
>
> if ($format eq 'fasta') { $dformat = 'fasta' }
>
> my $gb = new Bio::DB::GenBank(	-retrievaltype => 'tempfile',
> 				-format => $dformat,
> 				-db => $db,
> 			);
> my $seqio = $gb->get_Stream_by_acc(\@list);
>
> my $seqout = Bio::SeqIO->new( -file => ">$ARGV[1]",
> 				-format => $format,
> 			);
> while (my $seqo = $seqio->next_seq ) {
> 	print $seqo->id, "\n";
> 	$seqout->write_seq($seqo);
> }