[Bioperl-l] Reliability of get_Seq_by_gi

Sun Apr 4 18:22:11 EDT 2004

Heikki Lehvaslaiho wrote:
> Wes,
> 
> See:
> 	http://bioperl.org/pipermail/bioperl-l/2004-March/015423.html
> 
> Yours,
> 	-Heikki

Yes, but as I pointed out in my message, there are some problems in
doing that.

> 
> On Thursday 01 Apr 2004 22:02, Wes Barris wrote:
> 
>>Hi,
>>
>>I have been struggling for a long time with a bioperl script that
>>retrieves a series of sequences in genbank format.  The problem
>>is that there is the occasional error in downloading a sequence
>>which results in a fatal error in my script.  I need to be able
>>to gracefully recover from a downloading error such that my
>>script continues to run.
>>
>>Here is my script:
>>
>>#!/usr/local/bin/perl -w
>>#
>>#
>>use Bio::DB::GenBank;
>>my $usage = "Usage: $0 <gi|accession|infile> <out.gb|out.fa>\n";
>>my $accOrGi = shift or die $usage;
>>my $outfile = shift or die $usage;
>>my $wtime = 73;
>>#
>># Prepare input.
>>#
>>my @accOrGiList = ();   # empty the list
>>if (-r $accOrGi) {
>>    open(IN, $accOrGi);
>>    while (<IN>) {
>>       chomp;
>>       my @junk = split(/\s+/);
>>       push(@accOrGiList, $junk[0]);
>>       }
>>    close(IN);
>>    }
>>else {
>>    push(@accOrGiList, $accOrGi);
>>    }
>>#
>># Prepare output.
>>#
>>my $seq_out;
>>if ($outfile =~ /\.gb$/) {
>>    $seq_out = Bio::SeqIO->new( -format=>'genbank', -file=>">$outfile");
>>    }
>>elsif ($outfile =~ /\.fa$/) {
>>    $seq_out = Bio::SeqIO->new( -format=>'fasta', -file=>">$outfile");
>>    }
>>else {
>>    die "Unrecognized output file extension. Use either .gb or .fa\n";
>>    }
>>#
>># Get the entries.
>>#
>>my $gb = Bio::DB::GenBank->new();
>>my $seq;
>>my $got = 0;
>>foreach $accOrGi (@accOrGiList) {
>>    if ($accOrGi =~ /^\d+/) {
>>       print("Getting gi $accOrGi ($got/$#accOrGiList)...\n");
>>       $seq = $gb->get_Seq_by_gi($accOrGi);
>>       }
>>    else {
>>       print("Getting accession $accOrGi ($got/$#accOrGiList)...\n");
>>       $seq = $gb->get_Seq_by_acc($accOrGi);
>>       }
>>    $seq_out->write_seq($seq);
>>    $got++;
>>    sleep $wtime unless ($wtime == 0);
>>    }
>>
>>Using the above script, at least once a day, I get this error:
>>
>>Getting gi 45498451 (747/41132)...
>>Getting gi 45498450 (748/41132)...
>>Getting gi 45498449 (749/41132)...
>>
>>------------- EXCEPTION  -------------
>>MSG: WebDBSeqI Request Error:
>>500 (Internal Server Error) read timeout
>>Client-Date: Thu, 01 Apr 2004 13:54:45 GMT
>>
>>
>>
>>STACK Bio::DB::WebDBSeqI::_stream_request
>>/usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:728
>>STACK Bio::DB::WebDBSeqI::get_seq_stream
>>/usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:460
>>STACK Bio::DB::WebDBSeqI::get_Stream_by_gi
>>/usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:330
>>STACK Bio::DB::WebDBSeqI::get_Seq_by_gi
>>/usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:210
>>STACK toplevel /home/wes/proj/genbank/getGenbank.pl:50
>>
>>--------------------------------------
>>
>>-------------------- WARNING ---------------------
>>MSG: gi (45498449) does not exist
>>---------------------------------------------------
>>
>>------------- EXCEPTION  -------------
>>MSG: Attempting to write with no seq!
>>STACK Bio::SeqIO::genbank::write_seq
>>/usr/local/lib/perl5/site_perl/5.6.1/Bio/SeqIO/genbank.pm:591
>>STACK toplevel /home/wes/proj/genbank/getGenbank.pl:77
>>
>>--------------------------------------
>>
>>and then the script dies.  I have tried wrapping the "get" part
>>inside an eval like this:
>>
>>foreach $accOrGi (@accOrGiList) {
>>    eval {
>>       if ($accOrGi =~ /^\d+/) {
>>          print("Getting gi $accOrGi ($got/$#accOrGiList)...\n");
>>          $seq = $gb->get_Seq_by_gi($accOrGi);
>>          }
>>       else {
>>          print("Getting accession $accOrGi ($got/$#accOrGiList)...\n");
>>          $seq = $gb->get_Seq_by_acc($accOrGi);
>>          }
>>       };
>>    if ($@) {
>>       print("$accOrGi failed.\n");
>>       }
>>    .
>>    .
>>    .
>>
>>This prevents the script from dieing, but it causes several other problems:
>>    - it spawns a second instance of my script for each error
>>    - STDOUT stops being echoed to the terminal (I no longer see any
>>printed output) - it starts getting duplicate sequences
>>
>>Is there a reliable way to retrieve sequeces from NCBI?
>>
>>I'm using bioperl-1.4.
> 
> 

-- 
Wes Barris
E-Mail: Wes.Barris at csiro.au