[Bioperl-l] Bio::DB::GenBank and large number of requests

Wed Jan 30 14:56:07 UTC 2008

Thank you both!

Just in case it might be usefull for someone else, here are my ramblings:

1. I first tried to adapt my script and fetch 500 sequences at a time. It works, except that ~40% of the time NCBI gives the following error and my script crashed:

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: WebDBSeqI Request Error:
[...]
    The proxy server received an invalid
    response from an upstream server.
[...]
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/local/share/perl/5.8.8/Bio/Root/Root.pm:359
STACK: Bio::DB::WebDBSeqI::_request /usr/local/share/perl/5.8.8/Bio/DB/WebDBSeqI.pm:685
STACK: Bio::DB::WebDBSeqI::get_seq_stream /usr/local/share/perl/5.8.8/Bio/DB/WebDBSeqI.pm:472
STACK: Bio::DB::NCBIHelper::get_Stream_by_acc /usr/local/share/perl/5.8.8/Bio/DB/NCBIHelper.pm:361
STACK: ./fetch_from_genbank.pl:68
-----------------------------------------------------------

I tried to modify the script so that when the retrieval of a 500 sequence block crashes, it continues with the other blocks, but I was unsuccessfull. It probably needs some better understanding of BioPerl errors...
Here is the section of the script that was modified:
#########
my $n_seq = scalar @list;
my @aborted;

for (my $i=1; $i<=$n_seq; $i += 500) {
	print "Fetching sequences $i to ", $i+499, ": ";
	my $start = $i -1;
	my $end = $i + 500 -1;
	my @red_list = @list[$start .. $end]; 
	my $gb = new Bio::DB::GenBank(	-retrievaltype => 'tempfile',
					-format => $dformat,
					-db => $db,
				);

	my $seqio;
	unless(	$seqio = $gb->get_Stream_by_acc(\@red_list)) {
		print "Aborted, resubmit latter\n";
		push @aborted, @red_list;
		next;
	}

	my $seqout = Bio::SeqIO->new( -file => ">$ARGV[1].$i",
					-format => $format,
				);
	while (my $seqo = $seqio->next_seq ) {
# 		print $seqo->id, "\n";
		$seqout->write_seq($seqo);
	}
	print "Done\n";
}

if (@aborted) {
	open OUT, ">aborted_fetching.AN";
	foreach (@aborted) { print OUT $_ };
}
##########

2. So I moved to the second solution and tried batchentrez. I cut my 120,000 long AN list into 10,000 long pieces using split:
split -l 10000 full_list.AN splitted_list_

and then submitted the 13 lists one by one. I must say that I don't really like using a web-interface to fetch data, and here the most ennoying part is that you end up with a regular Entrez/GenBank webpage: select your format, export to file, chosse file name... and have to do it many times.
It is too much prone to human and web-browser errors for my taste, but it worked.
Nevertheless there is some caveats: 
- some downloaded files were incomplete (~10%) and you have to restart it
- you can't submit several lists in the same time (otherwise the same cookie will be used and you'll end up with several identical files) 

-Tristan

On Tuesday 29 January 2008 13:44:16 you wrote:
> Forgot about that one; it's definitely a better way to do it if you
> have the GI/accessions.
>
> chris
>
> On Jan 29, 2008, at 12:32 PM, Alexander Kozik wrote:
> > you don't need to use bioperl to accomplish this task, to download
> > several thousand sequences based on accession ID list.
> >
> > NCBI batch Entrez can do that:
> > http://www.ncbi.nlm.nih.gov/sites/batchentrez
> >
> > just submit a large list of IDs, select database, and download.
> >
> > you can submit ~50,000 IDs in one file usually without problems.
> > it may not return results if a list is larger than ~100,000 IDs
> >
> > --
> > Alexander Kozik
> > Bioinformatics Specialist
> > Genome and Biomedical Sciences Facility
> > 451 Health Sciences Drive
> > Genome Center, 4-th floor, room 4302
> > University of California
> > Davis, CA 95616-8816
> > Phone: (530) 754-9127
> > email#1: akozik at atgc.org
> > email#2: akozik at gmail.com
> > web: http://www.atgc.org/
> >
> > Chris Fields wrote:
> >> Yes, you can only retrieve ~500 sequences at a time using either
> >> Bio::DB::GenBank.  Both Bio::DB::GenBank and Bio::DB::EUtilities
> >> interact with NCBI's EUtilities (the former module returns raw data
> >> from the URL to be processed later, the latter module returns
> >> Bio::Seq/Bio::SeqIO objects).
> >> http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=coursework.section.large-d
> >>atasets You can usually post more IDs using epost and fetch sequence
> >> referring to the WebEnv/key combo (batch posting).  I try to make
> >> this a bit easier with EUtilities but it is woefully lacking in
> >> documentation (my fault), but there is some code up on the wiki
> >> which should work.
> >> chris
> >>
> >> On Jan 29, 2008, at 11:19 AM, Tristan Lefebure wrote:
> >>> Hello,
> >>>
> >>> I would like to download a large number of sequences from GenBank
> >>> (122,146 to be exact) following a list of accession numbers.
> >>> I first investigated around Bio::DB::EUtilities, but got lost and
> >>> finally used Bio::DB::GenBank.
> >>> My script works well for short request, but it gives the following
> >>> error with the long request:
> >>>
> >>> ------------- EXCEPTION: Bio::Root::Exception -------------
> >>> MSG: WebDBSeqI Request Error:
> >>> 500 short write
> >>> Content-Type: text/plain
> >>> Client-Date: Tue, 29 Jan 2008 17:22:46 GMT
> >>> Client-Warning: Internal response
> >>>
> >>> 500 short write
> >>>
> >>> STACK: Error::throw
> >>> STACK: Bio::Root::Root::throw /usr/local/share/perl/5.8.8/Bio/Root/
> >>> Root.pm:359
> >>> STACK: Bio::DB::WebDBSeqI::_request /usr/local/share/perl/5.8.8/
> >>> Bio/DB/WebDBSeqI.pm:685
> >>> STACK: Bio::DB::WebDBSeqI::get_seq_stream /usr/local/share/perl/
> >>> 5.8.8/Bio/DB/WebDBSeqI.pm:472
> >>> STACK: Bio::DB::NCBIHelper::get_Stream_by_acc /usr/local/share/
> >>> perl/5.8.8/Bio/DB/NCBIHelper.pm:361
> >>> STACK: ./fetch_from_genbank.pl:58
> >>> ---------------------------------------------------------
> >>>
> >>> Does that mean that we can only fetch 500 sequences at a time?
> >>> Should I split my list in 500 ids framents and submit them one
> >>> after the other?
> >>>
> >>> Any suggestions very welcomed...
> >>> Thanks,
> >>> -Tristan
> >>>
> >>>
> >>> Here is the script:
> >>>
> >>> ##################################
> >>> use strict;
> >>> use warnings;
> >>> use Bio::DB::GenBank;
> >>> # use Bio::DB::EUtilities;
> >>> use Bio::SeqIO;
> >>> use Getopt::Long;
> >>>
> >>> # 2008-01-22 T Lefebure
> >>> # I tried to use Bio::DB::EUtilities without much succes and get
> >>> back to Bio::DB::GenBank.
> >>> # The following procedure is not really good as the stream is
> >>> first copied to a temporary file,
> >>> # and than re-used by BioPerl to generate the final file.
> >>>
> >>> my $db = 'nucleotide';
> >>> my $format = 'genbank';
> >>> my $help= '';
> >>> my $dformat = 'gb';
> >>>
> >>> GetOptions(
> >>>    'help|?' => \$help,
> >>>    'format=s'  => \$format,
> >>>    'database=s'    => \$db,
> >>> );
> >>>
> >>>
> >>> my $printhelp = "\nUsage: $0 [options] <list of ids or acc> <output>
> >>>
> >>> Will download the corresponding data from GenBank. BioPerl is
> >>> required.
> >>>
> >>> Options:
> >>>    -h
> >>>        print this help
> >>>    -format: genbank|fasta|...
> >>>        give output format (default=genbank)
> >>>    -database: nucleotide|genome|protein|...
> >>>        define the database to search in (default=nucleotide)
> >>>
> >>> The full description of the options can be find at
> >>> http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html
> >>> \n";
> >>>
> >>> if ($#ARGV<1) {
> >>>    print $printhelp;
> >>>    exit;
> >>> }
> >>>
> >>> open LIST, $ARGV[0];
> >>> my @list = <LIST>;
> >>>
> >>> if ($format eq 'fasta') { $dformat = 'fasta' }
> >>>
> >>> my $gb = new Bio::DB::GenBank(    -retrievaltype => 'tempfile',
> >>>                -format => $dformat,
> >>>                -db => $db,
> >>>            );
> >>> my $seqio = $gb->get_Stream_by_acc(\@list);
> >>>
> >>> my $seqout = Bio::SeqIO->new( -file => ">$ARGV[1]",
> >>>                -format => $format,
> >>>            );
> >>> while (my $seqo = $seqio->next_seq ) {
> >>>    print $seqo->id, "\n";
> >>>    $seqout->write_seq($seqo);
> >>> }
> >>> _______________________________________________
> >>> Bioperl-l mailing list
> >>> Bioperl-l at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>
> >> Christopher Fields
> >> Postdoctoral Researcher
> >> Lab of Dr. Robert Switzer
> >> Dept of Biochemistry
> >> University of Illinois Urbana-Champaign
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Robert Switzer
> Dept of Biochemistry
> University of Illinois Urbana-Champaign