[Bioperl-l] Bio::DB::GenBank and large number of requests

Wed Jan 30 19:34:44 UTC 2008

Take a look at
http://www.ncbi.nlm.nih.gov/Class/PowerTools/eutils/ebot/ebot.cgi
Ebot is an interactive tool that generates a Perl script that implements
an E-utility pipeline.
You can probably hack the resulting script to introduce the required
BioPerly bits.

Russell Smithies 

Bioinformatics Software Developer 
T +64 3 489 9085 
E  russell.smithies at agresearch.co.nz 

Invermay  Research Centre 
Puddle Alley, 
Mosgiel, 
New Zealand 
T  +64 3 489 3809   
F  +64 3 489 9174  
www.agresearch.co.nz

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org
[mailto:bioperl-l-bounces at lists.open-
> bio.org] On Behalf Of Tristan Lefebure
> Sent: Thursday, 31 January 2008 3:56 a.m.
> To: Bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Bio::DB::GenBank and large number of requests
> 
> Thank you both!
> 
> Just in case it might be usefull for someone else, here are my
ramblings:
> 
> 1. I first tried to adapt my script and fetch 500 sequences at a time.
It works,
> except that ~40% of the time NCBI gives the following error and my
script crashed:
> 
> ------------- EXCEPTION: Bio::Root::Exception -------------
> MSG: WebDBSeqI Request Error:
> [...]
>     The proxy server received an invalid
>     response from an upstream server.
> [...]
> STACK: Error::throw
> STACK: Bio::Root::Root::throw
/usr/local/share/perl/5.8.8/Bio/Root/Root.pm:359
> STACK: Bio::DB::WebDBSeqI::_request
> /usr/local/share/perl/5.8.8/Bio/DB/WebDBSeqI.pm:685
> STACK: Bio::DB::WebDBSeqI::get_seq_stream
> /usr/local/share/perl/5.8.8/Bio/DB/WebDBSeqI.pm:472
> STACK: Bio::DB::NCBIHelper::get_Stream_by_acc
> /usr/local/share/perl/5.8.8/Bio/DB/NCBIHelper.pm:361
> STACK: ./fetch_from_genbank.pl:68
> -----------------------------------------------------------
> 
> I tried to modify the script so that when the retrieval of a 500
sequence block
> crashes, it continues with the other blocks, but I was unsuccessfull.
It probably
> needs some better understanding of BioPerl errors...
> Here is the section of the script that was modified:
> #########
> my $n_seq = scalar @list;
> my @aborted;
> 
> for (my $i=1; $i<=$n_seq; $i += 500) {
> 	print "Fetching sequences $i to ", $i+499, ": ";
> 	my $start = $i -1;
> 	my $end = $i + 500 -1;
> 	my @red_list = @list[$start .. $end];
> 	my $gb = new Bio::DB::GenBank(	-retrievaltype => 'tempfile',
> 					-format => $dformat,
> 					-db => $db,
> 				);
> 
> 	my $seqio;
> 	unless(	$seqio = $gb->get_Stream_by_acc(\@red_list)) {
> 		print "Aborted, resubmit latter\n";
> 		push @aborted, @red_list;
> 		next;
> 	}
> 
> 	my $seqout = Bio::SeqIO->new( -file => ">$ARGV[1].$i",
> 					-format => $format,
> 				);
> 	while (my $seqo = $seqio->next_seq ) {
> # 		print $seqo->id, "\n";
> 		$seqout->write_seq($seqo);
> 	}
> 	print "Done\n";
> }
> 
> if (@aborted) {
> 	open OUT, ">aborted_fetching.AN";
> 	foreach (@aborted) { print OUT $_ };
> }
> ##########
> 
> 
> 2. So I moved to the second solution and tried batchentrez. I cut my
120,000 long
> AN list into 10,000 long pieces using split:
> split -l 10000 full_list.AN splitted_list_
> 
> and then submitted the 13 lists one by one. I must say that I don't
really like using
> a web-interface to fetch data, and here the most ennoying part is that
you end up
> with a regular Entrez/GenBank webpage: select your format, export to
file, chosse
> file name... and have to do it many times.
> It is too much prone to human and web-browser errors for my taste, but
it worked.
> Nevertheless there is some caveats:
> - some downloaded files were incomplete (~10%) and you have to restart
it
> - you can't submit several lists in the same time (otherwise the same
cookie will be
> used and you'll end up with several identical files)
> 
> -Tristan
> 
> On Tuesday 29 January 2008 13:44:16 you wrote:
> > Forgot about that one; it's definitely a better way to do it if you
> > have the GI/accessions.
> >
> > chris
> >
> > On Jan 29, 2008, at 12:32 PM, Alexander Kozik wrote:
> > > you don't need to use bioperl to accomplish this task, to download
> > > several thousand sequences based on accession ID list.
> > >
> > > NCBI batch Entrez can do that:
> > > http://www.ncbi.nlm.nih.gov/sites/batchentrez
> > >
> > > just submit a large list of IDs, select database, and download.
> > >
> > > you can submit ~50,000 IDs in one file usually without problems.
> > > it may not return results if a list is larger than ~100,000 IDs
> > >
> > > --
> > > Alexander Kozik
> > > Bioinformatics Specialist
> > > Genome and Biomedical Sciences Facility
> > > 451 Health Sciences Drive
> > > Genome Center, 4-th floor, room 4302
> > > University of California
> > > Davis, CA 95616-8816
> > > Phone: (530) 754-9127
> > > email#1: akozik at atgc.org
> > > email#2: akozik at gmail.com
> > > web: http://www.atgc.org/
> > >
> > > Chris Fields wrote:
> > >> Yes, you can only retrieve ~500 sequences at a time using either
> > >> Bio::DB::GenBank.  Both Bio::DB::GenBank and Bio::DB::EUtilities
> > >> interact with NCBI's EUtilities (the former module returns raw
data
> > >> from the URL to be processed later, the latter module returns
> > >> Bio::Seq/Bio::SeqIO objects).
> > >>
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=coursework.section.large-d
> > >>atasets You can usually post more IDs using epost and fetch
sequence
> > >> referring to the WebEnv/key combo (batch posting).  I try to make
> > >> this a bit easier with EUtilities but it is woefully lacking in
> > >> documentation (my fault), but there is some code up on the wiki
> > >> which should work.
> > >> chris
> > >>
> > >> On Jan 29, 2008, at 11:19 AM, Tristan Lefebure wrote:
> > >>> Hello,
> > >>>
> > >>> I would like to download a large number of sequences from
GenBank
> > >>> (122,146 to be exact) following a list of accession numbers.
> > >>> I first investigated around Bio::DB::EUtilities, but got lost
and
> > >>> finally used Bio::DB::GenBank.
> > >>> My script works well for short request, but it gives the
following
> > >>> error with the long request:
> > >>>
> > >>> ------------- EXCEPTION: Bio::Root::Exception -------------
> > >>> MSG: WebDBSeqI Request Error:
> > >>> 500 short write
> > >>> Content-Type: text/plain
> > >>> Client-Date: Tue, 29 Jan 2008 17:22:46 GMT
> > >>> Client-Warning: Internal response
> > >>>
> > >>> 500 short write
> > >>>
> > >>> STACK: Error::throw
> > >>> STACK: Bio::Root::Root::throw
/usr/local/share/perl/5.8.8/Bio/Root/
> > >>> Root.pm:359
> > >>> STACK: Bio::DB::WebDBSeqI::_request /usr/local/share/perl/5.8.8/
> > >>> Bio/DB/WebDBSeqI.pm:685
> > >>> STACK: Bio::DB::WebDBSeqI::get_seq_stream /usr/local/share/perl/
> > >>> 5.8.8/Bio/DB/WebDBSeqI.pm:472
> > >>> STACK: Bio::DB::NCBIHelper::get_Stream_by_acc /usr/local/share/
> > >>> perl/5.8.8/Bio/DB/NCBIHelper.pm:361
> > >>> STACK: ./fetch_from_genbank.pl:58
> > >>> ---------------------------------------------------------
> > >>>
> > >>> Does that mean that we can only fetch 500 sequences at a time?
> > >>> Should I split my list in 500 ids framents and submit them one
> > >>> after the other?
> > >>>
> > >>> Any suggestions very welcomed...
> > >>> Thanks,
> > >>> -Tristan
> > >>>
> > >>>
> > >>> Here is the script:
> > >>>
> > >>> ##################################
> > >>> use strict;
> > >>> use warnings;
> > >>> use Bio::DB::GenBank;
> > >>> # use Bio::DB::EUtilities;
> > >>> use Bio::SeqIO;
> > >>> use Getopt::Long;
> > >>>
> > >>> # 2008-01-22 T Lefebure
> > >>> # I tried to use Bio::DB::EUtilities without much succes and get
> > >>> back to Bio::DB::GenBank.
> > >>> # The following procedure is not really good as the stream is
> > >>> first copied to a temporary file,
> > >>> # and than re-used by BioPerl to generate the final file.
> > >>>
> > >>> my $db = 'nucleotide';
> > >>> my $format = 'genbank';
> > >>> my $help= '';
> > >>> my $dformat = 'gb';
> > >>>
> > >>> GetOptions(
> > >>>    'help|?' => \$help,
> > >>>    'format=s'  => \$format,
> > >>>    'database=s'    => \$db,
> > >>> );
> > >>>
> > >>>
> > >>> my $printhelp = "\nUsage: $0 [options] <list of ids or acc>
<output>
> > >>>
> > >>> Will download the corresponding data from GenBank. BioPerl is
> > >>> required.
> > >>>
> > >>> Options:
> > >>>    -h
> > >>>        print this help
> > >>>    -format: genbank|fasta|...
> > >>>        give output format (default=genbank)
> > >>>    -database: nucleotide|genome|protein|...
> > >>>        define the database to search in (default=nucleotide)
> > >>>
> > >>> The full description of the options can be find at
> > >>>
http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html
> > >>> \n";
> > >>>
> > >>> if ($#ARGV<1) {
> > >>>    print $printhelp;
> > >>>    exit;
> > >>> }
> > >>>
> > >>> open LIST, $ARGV[0];
> > >>> my @list = <LIST>;
> > >>>
> > >>> if ($format eq 'fasta') { $dformat = 'fasta' }
> > >>>
> > >>> my $gb = new Bio::DB::GenBank(    -retrievaltype => 'tempfile',
> > >>>                -format => $dformat,
> > >>>                -db => $db,
> > >>>            );
> > >>> my $seqio = $gb->get_Stream_by_acc(\@list);
> > >>>
> > >>> my $seqout = Bio::SeqIO->new( -file => ">$ARGV[1]",
> > >>>                -format => $format,
> > >>>            );
> > >>> while (my $seqo = $seqio->next_seq ) {
> > >>>    print $seqo->id, "\n";
> > >>>    $seqout->write_seq($seqo);
> > >>> }
> > >>> _______________________________________________
> > >>> Bioperl-l mailing list
> > >>> Bioperl-l at lists.open-bio.org
> > >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> > >>
> > >> Christopher Fields
> > >> Postdoctoral Researcher
> > >> Lab of Dr. Robert Switzer
> > >> Dept of Biochemistry
> > >> University of Illinois Urbana-Champaign
> > >> _______________________________________________
> > >> Bioperl-l mailing list
> > >> Bioperl-l at lists.open-bio.org
> > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> > Christopher Fields
> > Postdoctoral Researcher
> > Lab of Dr. Robert Switzer
> > Dept of Biochemistry
> > University of Illinois Urbana-Champaign
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================