[Bioperl-l] Bio::DB::GenBank and large number of requests
Chris Fields
cjfields at uiuc.edu
Tue Jan 29 18:06:08 UTC 2008
Yes, you can only retrieve ~500 sequences at a time using either
Bio::DB::GenBank. Both Bio::DB::GenBank and Bio::DB::EUtilities
interact with NCBI's EUtilities (the former module returns raw data
from the URL to be processed later, the latter module returns Bio::Seq/
Bio::SeqIO objects).
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=coursework.section.large-datasets
You can usually post more IDs using epost and fetch sequence referring
to the WebEnv/key combo (batch posting). I try to make this a bit
easier with EUtilities but it is woefully lacking in documentation (my
fault), but there is some code up on the wiki which should work.
chris
On Jan 29, 2008, at 11:19 AM, Tristan Lefebure wrote:
> Hello,
>
> I would like to download a large number of sequences from GenBank
> (122,146 to be exact) following a list of accession numbers.
> I first investigated around Bio::DB::EUtilities, but got lost and
> finally used Bio::DB::GenBank.
> My script works well for short request, but it gives the following
> error with the long request:
>
> ------------- EXCEPTION: Bio::Root::Exception -------------
> MSG: WebDBSeqI Request Error:
> 500 short write
> Content-Type: text/plain
> Client-Date: Tue, 29 Jan 2008 17:22:46 GMT
> Client-Warning: Internal response
>
> 500 short write
>
> STACK: Error::throw
> STACK: Bio::Root::Root::throw /usr/local/share/perl/5.8.8/Bio/Root/
> Root.pm:359
> STACK: Bio::DB::WebDBSeqI::_request /usr/local/share/perl/5.8.8/Bio/
> DB/WebDBSeqI.pm:685
> STACK: Bio::DB::WebDBSeqI::get_seq_stream /usr/local/share/perl/
> 5.8.8/Bio/DB/WebDBSeqI.pm:472
> STACK: Bio::DB::NCBIHelper::get_Stream_by_acc /usr/local/share/perl/
> 5.8.8/Bio/DB/NCBIHelper.pm:361
> STACK: ./fetch_from_genbank.pl:58
> ---------------------------------------------------------
>
> Does that mean that we can only fetch 500 sequences at a time?
> Should I split my list in 500 ids framents and submit them one after
> the other?
>
> Any suggestions very welcomed...
> Thanks,
> -Tristan
>
>
> Here is the script:
>
> ##################################
> use strict;
> use warnings;
> use Bio::DB::GenBank;
> # use Bio::DB::EUtilities;
> use Bio::SeqIO;
> use Getopt::Long;
>
> # 2008-01-22 T Lefebure
> # I tried to use Bio::DB::EUtilities without much succes and get
> back to Bio::DB::GenBank.
> # The following procedure is not really good as the stream is first
> copied to a temporary file,
> # and than re-used by BioPerl to generate the final file.
>
> my $db = 'nucleotide';
> my $format = 'genbank';
> my $help= '';
> my $dformat = 'gb';
>
> GetOptions(
> 'help|?' => \$help,
> 'format=s' => \$format,
> 'database=s' => \$db,
> );
>
>
> my $printhelp = "\nUsage: $0 [options] <list of ids or acc> <output>
>
> Will download the corresponding data from GenBank. BioPerl is
> required.
>
> Options:
> -h
> print this help
> -format: genbank|fasta|...
> give output format (default=genbank)
> -database: nucleotide|genome|protein|...
> define the database to search in (default=nucleotide)
>
> The full description of the options can be find at http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html
> \n";
>
> if ($#ARGV<1) {
> print $printhelp;
> exit;
> }
>
> open LIST, $ARGV[0];
> my @list = <LIST>;
>
> if ($format eq 'fasta') { $dformat = 'fasta' }
>
> my $gb = new Bio::DB::GenBank( -retrievaltype => 'tempfile',
> -format => $dformat,
> -db => $db,
> );
> my $seqio = $gb->get_Stream_by_acc(\@list);
>
> my $seqout = Bio::SeqIO->new( -file => ">$ARGV[1]",
> -format => $format,
> );
> while (my $seqo = $seqio->next_seq ) {
> print $seqo->id, "\n";
> $seqout->write_seq($seqo);
> }
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
More information about the Bioperl-l
mailing list