[Bioperl-l] Bio::DB::GenBank and large number of requests

Tue Jan 29 18:44:16 UTC 2008

Forgot about that one; it's definitely a better way to do it if you  
have the GI/accessions.

chris

On Jan 29, 2008, at 12:32 PM, Alexander Kozik wrote:

> you don't need to use bioperl to accomplish this task, to download  
> several thousand sequences based on accession ID list.
>
> NCBI batch Entrez can do that:
> http://www.ncbi.nlm.nih.gov/sites/batchentrez
>
> just submit a large list of IDs, select database, and download.
>
> you can submit ~50,000 IDs in one file usually without problems.
> it may not return results if a list is larger than ~100,000 IDs
>
> --
> Alexander Kozik
> Bioinformatics Specialist
> Genome and Biomedical Sciences Facility
> 451 Health Sciences Drive
> Genome Center, 4-th floor, room 4302
> University of California
> Davis, CA 95616-8816
> Phone: (530) 754-9127
> email#1: akozik at atgc.org
> email#2: akozik at gmail.com
> web: http://www.atgc.org/
>
>
>
> Chris Fields wrote:
>> Yes, you can only retrieve ~500 sequences at a time using either  
>> Bio::DB::GenBank.  Both Bio::DB::GenBank and Bio::DB::EUtilities  
>> interact with NCBI's EUtilities (the former module returns raw data  
>> from the URL to be processed later, the latter module returns  
>> Bio::Seq/Bio::SeqIO objects).
>> http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=coursework.section.large-datasets 
>>  You can usually post more IDs using epost and fetch sequence  
>> referring to the WebEnv/key combo (batch posting).  I try to make  
>> this a bit easier with EUtilities but it is woefully lacking in  
>> documentation (my fault), but there is some code up on the wiki  
>> which should work.
>> chris
>> On Jan 29, 2008, at 11:19 AM, Tristan Lefebure wrote:
>>> Hello,
>>>
>>> I would like to download a large number of sequences from GenBank  
>>> (122,146 to be exact) following a list of accession numbers.
>>> I first investigated around Bio::DB::EUtilities, but got lost and  
>>> finally used Bio::DB::GenBank.
>>> My script works well for short request, but it gives the following  
>>> error with the long request:
>>>
>>> ------------- EXCEPTION: Bio::Root::Exception -------------
>>> MSG: WebDBSeqI Request Error:
>>> 500 short write
>>> Content-Type: text/plain
>>> Client-Date: Tue, 29 Jan 2008 17:22:46 GMT
>>> Client-Warning: Internal response
>>>
>>> 500 short write
>>>
>>> STACK: Error::throw
>>> STACK: Bio::Root::Root::throw /usr/local/share/perl/5.8.8/Bio/Root/ 
>>> Root.pm:359
>>> STACK: Bio::DB::WebDBSeqI::_request /usr/local/share/perl/5.8.8/ 
>>> Bio/DB/WebDBSeqI.pm:685
>>> STACK: Bio::DB::WebDBSeqI::get_seq_stream /usr/local/share/perl/ 
>>> 5.8.8/Bio/DB/WebDBSeqI.pm:472
>>> STACK: Bio::DB::NCBIHelper::get_Stream_by_acc /usr/local/share/ 
>>> perl/5.8.8/Bio/DB/NCBIHelper.pm:361
>>> STACK: ./fetch_from_genbank.pl:58
>>> ---------------------------------------------------------
>>>
>>> Does that mean that we can only fetch 500 sequences at a time?
>>> Should I split my list in 500 ids framents and submit them one  
>>> after the other?
>>>
>>> Any suggestions very welcomed...
>>> Thanks,
>>> -Tristan
>>>
>>>
>>> Here is the script:
>>>
>>> ##################################
>>> use strict;
>>> use warnings;
>>> use Bio::DB::GenBank;
>>> # use Bio::DB::EUtilities;
>>> use Bio::SeqIO;
>>> use Getopt::Long;
>>>
>>> # 2008-01-22 T Lefebure
>>> # I tried to use Bio::DB::EUtilities without much succes and get  
>>> back to Bio::DB::GenBank.
>>> # The following procedure is not really good as the stream is  
>>> first copied to a temporary file,
>>> # and than re-used by BioPerl to generate the final file.
>>>
>>> my $db = 'nucleotide';
>>> my $format = 'genbank';
>>> my $help= '';
>>> my $dformat = 'gb';
>>>
>>> GetOptions(
>>>    'help|?' => \$help,
>>>    'format=s'  => \$format,
>>>    'database=s'    => \$db,
>>> );
>>>
>>>
>>> my $printhelp = "\nUsage: $0 [options] <list of ids or acc> <output>
>>>
>>> Will download the corresponding data from GenBank. BioPerl is  
>>> required.
>>>
>>> Options:
>>>    -h
>>>        print this help
>>>    -format: genbank|fasta|...
>>>        give output format (default=genbank)
>>>    -database: nucleotide|genome|protein|...
>>>        define the database to search in (default=nucleotide)
>>>
>>> The full description of the options can be find at http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html 
>>> \n";
>>>
>>> if ($#ARGV<1) {
>>>    print $printhelp;
>>>    exit;
>>> }
>>>
>>> open LIST, $ARGV[0];
>>> my @list = <LIST>;
>>>
>>> if ($format eq 'fasta') { $dformat = 'fasta' }
>>>
>>> my $gb = new Bio::DB::GenBank(    -retrievaltype => 'tempfile',
>>>                -format => $dformat,
>>>                -db => $db,
>>>            );
>>> my $seqio = $gb->get_Stream_by_acc(\@list);
>>>
>>> my $seqout = Bio::SeqIO->new( -file => ">$ARGV[1]",
>>>                -format => $format,
>>>            );
>>> while (my $seqo = $seqio->next_seq ) {
>>>    print $seqo->id, "\n";
>>>    $seqout->write_seq($seqo);
>>> }
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> Christopher Fields
>> Postdoctoral Researcher
>> Lab of Dr. Robert Switzer
>> Dept of Biochemistry
>> University of Illinois Urbana-Champaign
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign