[Bioperl-l] Reliability of get_Seq_by_gi

Fri Apr 2 03:43:36 EST 2004

Wes,

See:
	http://bioperl.org/pipermail/bioperl-l/2004-March/015423.html

Yours,
	-Heikki

On Thursday 01 Apr 2004 22:02, Wes Barris wrote:
> Hi,
>
> I have been struggling for a long time with a bioperl script that
> retrieves a series of sequences in genbank format.  The problem
> is that there is the occasional error in downloading a sequence
> which results in a fatal error in my script.  I need to be able
> to gracefully recover from a downloading error such that my
> script continues to run.
>
> Here is my script:
>
> #!/usr/local/bin/perl -w
> #
> #
> use Bio::DB::GenBank;
> my $usage = "Usage: $0 <gi|accession|infile> <out.gb|out.fa>\n";
> my $accOrGi = shift or die $usage;
> my $outfile = shift or die $usage;
> my $wtime = 73;
> #
> # Prepare input.
> #
> my @accOrGiList = ();   # empty the list
> if (-r $accOrGi) {
>     open(IN, $accOrGi);
>     while (<IN>) {
>        chomp;
>        my @junk = split(/\s+/);
>        push(@accOrGiList, $junk[0]);
>        }
>     close(IN);
>     }
> else {
>     push(@accOrGiList, $accOrGi);
>     }
> #
> # Prepare output.
> #
> my $seq_out;
> if ($outfile =~ /\.gb$/) {
>     $seq_out = Bio::SeqIO->new( -format=>'genbank', -file=>">$outfile");
>     }
> elsif ($outfile =~ /\.fa$/) {
>     $seq_out = Bio::SeqIO->new( -format=>'fasta', -file=>">$outfile");
>     }
> else {
>     die "Unrecognized output file extension. Use either .gb or .fa\n";
>     }
> #
> # Get the entries.
> #
> my $gb = Bio::DB::GenBank->new();
> my $seq;
> my $got = 0;
> foreach $accOrGi (@accOrGiList) {
>     if ($accOrGi =~ /^\d+/) {
>        print("Getting gi $accOrGi ($got/$#accOrGiList)...\n");
>        $seq = $gb->get_Seq_by_gi($accOrGi);
>        }
>     else {
>        print("Getting accession $accOrGi ($got/$#accOrGiList)...\n");
>        $seq = $gb->get_Seq_by_acc($accOrGi);
>        }
>     $seq_out->write_seq($seq);
>     $got++;
>     sleep $wtime unless ($wtime == 0);
>     }
>
> Using the above script, at least once a day, I get this error:
>
> Getting gi 45498451 (747/41132)...
> Getting gi 45498450 (748/41132)...
> Getting gi 45498449 (749/41132)...
>
> ------------- EXCEPTION  -------------
> MSG: WebDBSeqI Request Error:
> 500 (Internal Server Error) read timeout
> Client-Date: Thu, 01 Apr 2004 13:54:45 GMT
>
>
>
> STACK Bio::DB::WebDBSeqI::_stream_request
> /usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:728
> STACK Bio::DB::WebDBSeqI::get_seq_stream
> /usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:460
> STACK Bio::DB::WebDBSeqI::get_Stream_by_gi
> /usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:330
> STACK Bio::DB::WebDBSeqI::get_Seq_by_gi
> /usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:210
> STACK toplevel /home/wes/proj/genbank/getGenbank.pl:50
>
> --------------------------------------
>
> -------------------- WARNING ---------------------
> MSG: gi (45498449) does not exist
> ---------------------------------------------------
>
> ------------- EXCEPTION  -------------
> MSG: Attempting to write with no seq!
> STACK Bio::SeqIO::genbank::write_seq
> /usr/local/lib/perl5/site_perl/5.6.1/Bio/SeqIO/genbank.pm:591
> STACK toplevel /home/wes/proj/genbank/getGenbank.pl:77
>
> --------------------------------------
>
> and then the script dies.  I have tried wrapping the "get" part
> inside an eval like this:
>
> foreach $accOrGi (@accOrGiList) {
>     eval {
>        if ($accOrGi =~ /^\d+/) {
>           print("Getting gi $accOrGi ($got/$#accOrGiList)...\n");
>           $seq = $gb->get_Seq_by_gi($accOrGi);
>           }
>        else {
>           print("Getting accession $accOrGi ($got/$#accOrGiList)...\n");
>           $seq = $gb->get_Seq_by_acc($accOrGi);
>           }
>        };
>     if ($@) {
>        print("$accOrGi failed.\n");
>        }
>     .
>     .
>     .
>
> This prevents the script from dieing, but it causes several other problems:
>     - it spawns a second instance of my script for each error
>     - STDOUT stops being echoed to the terminal (I no longer see any
> printed output) - it starts getting duplicate sequences
>
> Is there a reliable way to retrieve sequeces from NCBI?
>
> I'm using bioperl-1.4.

-- 
______ _/      _/_____________________________________________________
      _/      _/                      http://www.ebi.ac.uk/mutations/
     _/  _/  _/  Heikki Lehvaslaiho    heikki_at_ebi ac uk
    _/_/_/_/_/  EMBL Outstation, European Bioinformatics Institute
   _/  _/  _/  Wellcome Trust Genome Campus, Hinxton
  _/  _/  _/  Cambs. CB10 1SD, United Kingdom
     _/      Phone: +44 (0)1223 494 644   FAX: +44 (0)1223 494 468
___ _/_/_/_/_/________________________________________________________