[Bioperl-l] Bio::DB::GenBank and large number of requests
Alexander Kozik
akozik at atgc.org
Tue Jan 29 18:32:41 UTC 2008
you don't need to use bioperl to accomplish this task, to download
several thousand sequences based on accession ID list.
NCBI batch Entrez can do that:
http://www.ncbi.nlm.nih.gov/sites/batchentrez
just submit a large list of IDs, select database, and download.
you can submit ~50,000 IDs in one file usually without problems.
it may not return results if a list is larger than ~100,000 IDs
--
Alexander Kozik
Bioinformatics Specialist
Genome and Biomedical Sciences Facility
451 Health Sciences Drive
Genome Center, 4-th floor, room 4302
University of California
Davis, CA 95616-8816
Phone: (530) 754-9127
email#1: akozik at atgc.org
email#2: akozik at gmail.com
web: http://www.atgc.org/
Chris Fields wrote:
> Yes, you can only retrieve ~500 sequences at a time using either
> Bio::DB::GenBank. Both Bio::DB::GenBank and Bio::DB::EUtilities
> interact with NCBI's EUtilities (the former module returns raw data from
> the URL to be processed later, the latter module returns
> Bio::Seq/Bio::SeqIO objects).
>
> http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=coursework.section.large-datasets
>
>
> You can usually post more IDs using epost and fetch sequence referring
> to the WebEnv/key combo (batch posting). I try to make this a bit
> easier with EUtilities but it is woefully lacking in documentation (my
> fault), but there is some code up on the wiki which should work.
>
> chris
>
> On Jan 29, 2008, at 11:19 AM, Tristan Lefebure wrote:
>
>> Hello,
>>
>> I would like to download a large number of sequences from GenBank
>> (122,146 to be exact) following a list of accession numbers.
>> I first investigated around Bio::DB::EUtilities, but got lost and
>> finally used Bio::DB::GenBank.
>> My script works well for short request, but it gives the following
>> error with the long request:
>>
>> ------------- EXCEPTION: Bio::Root::Exception -------------
>> MSG: WebDBSeqI Request Error:
>> 500 short write
>> Content-Type: text/plain
>> Client-Date: Tue, 29 Jan 2008 17:22:46 GMT
>> Client-Warning: Internal response
>>
>> 500 short write
>>
>> STACK: Error::throw
>> STACK: Bio::Root::Root::throw
>> /usr/local/share/perl/5.8.8/Bio/Root/Root.pm:359
>> STACK: Bio::DB::WebDBSeqI::_request
>> /usr/local/share/perl/5.8.8/Bio/DB/WebDBSeqI.pm:685
>> STACK: Bio::DB::WebDBSeqI::get_seq_stream
>> /usr/local/share/perl/5.8.8/Bio/DB/WebDBSeqI.pm:472
>> STACK: Bio::DB::NCBIHelper::get_Stream_by_acc
>> /usr/local/share/perl/5.8.8/Bio/DB/NCBIHelper.pm:361
>> STACK: ./fetch_from_genbank.pl:58
>> ---------------------------------------------------------
>>
>> Does that mean that we can only fetch 500 sequences at a time?
>> Should I split my list in 500 ids framents and submit them one after
>> the other?
>>
>> Any suggestions very welcomed...
>> Thanks,
>> -Tristan
>>
>>
>> Here is the script:
>>
>> ##################################
>> use strict;
>> use warnings;
>> use Bio::DB::GenBank;
>> # use Bio::DB::EUtilities;
>> use Bio::SeqIO;
>> use Getopt::Long;
>>
>> # 2008-01-22 T Lefebure
>> # I tried to use Bio::DB::EUtilities without much succes and get back
>> to Bio::DB::GenBank.
>> # The following procedure is not really good as the stream is first
>> copied to a temporary file,
>> # and than re-used by BioPerl to generate the final file.
>>
>> my $db = 'nucleotide';
>> my $format = 'genbank';
>> my $help= '';
>> my $dformat = 'gb';
>>
>> GetOptions(
>> 'help|?' => \$help,
>> 'format=s' => \$format,
>> 'database=s' => \$db,
>> );
>>
>>
>> my $printhelp = "\nUsage: $0 [options] <list of ids or acc> <output>
>>
>> Will download the corresponding data from GenBank. BioPerl is required.
>>
>> Options:
>> -h
>> print this help
>> -format: genbank|fasta|...
>> give output format (default=genbank)
>> -database: nucleotide|genome|protein|...
>> define the database to search in (default=nucleotide)
>>
>> The full description of the options can be find at
>> http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html\n";
>>
>> if ($#ARGV<1) {
>> print $printhelp;
>> exit;
>> }
>>
>> open LIST, $ARGV[0];
>> my @list = <LIST>;
>>
>> if ($format eq 'fasta') { $dformat = 'fasta' }
>>
>> my $gb = new Bio::DB::GenBank( -retrievaltype => 'tempfile',
>> -format => $dformat,
>> -db => $db,
>> );
>> my $seqio = $gb->get_Stream_by_acc(\@list);
>>
>> my $seqout = Bio::SeqIO->new( -file => ">$ARGV[1]",
>> -format => $format,
>> );
>> while (my $seqo = $seqio->next_seq ) {
>> print $seqo->id, "\n";
>> $seqout->write_seq($seqo);
>> }
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Robert Switzer
> Dept of Biochemistry
> University of Illinois Urbana-Champaign
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list