[Bioperl-l] Reliability of get_Seq_by_gi
Heikki Lehvaslaiho
heikki at ebi.ac.uk
Fri Apr 2 03:43:36 EST 2004
Wes,
See:
http://bioperl.org/pipermail/bioperl-l/2004-March/015423.html
Yours,
-Heikki
On Thursday 01 Apr 2004 22:02, Wes Barris wrote:
> Hi,
>
> I have been struggling for a long time with a bioperl script that
> retrieves a series of sequences in genbank format. The problem
> is that there is the occasional error in downloading a sequence
> which results in a fatal error in my script. I need to be able
> to gracefully recover from a downloading error such that my
> script continues to run.
>
> Here is my script:
>
> #!/usr/local/bin/perl -w
> #
> #
> use Bio::DB::GenBank;
> my $usage = "Usage: $0 <gi|accession|infile> <out.gb|out.fa>\n";
> my $accOrGi = shift or die $usage;
> my $outfile = shift or die $usage;
> my $wtime = 73;
> #
> # Prepare input.
> #
> my @accOrGiList = (); # empty the list
> if (-r $accOrGi) {
> open(IN, $accOrGi);
> while (<IN>) {
> chomp;
> my @junk = split(/\s+/);
> push(@accOrGiList, $junk[0]);
> }
> close(IN);
> }
> else {
> push(@accOrGiList, $accOrGi);
> }
> #
> # Prepare output.
> #
> my $seq_out;
> if ($outfile =~ /\.gb$/) {
> $seq_out = Bio::SeqIO->new( -format=>'genbank', -file=>">$outfile");
> }
> elsif ($outfile =~ /\.fa$/) {
> $seq_out = Bio::SeqIO->new( -format=>'fasta', -file=>">$outfile");
> }
> else {
> die "Unrecognized output file extension. Use either .gb or .fa\n";
> }
> #
> # Get the entries.
> #
> my $gb = Bio::DB::GenBank->new();
> my $seq;
> my $got = 0;
> foreach $accOrGi (@accOrGiList) {
> if ($accOrGi =~ /^\d+/) {
> print("Getting gi $accOrGi ($got/$#accOrGiList)...\n");
> $seq = $gb->get_Seq_by_gi($accOrGi);
> }
> else {
> print("Getting accession $accOrGi ($got/$#accOrGiList)...\n");
> $seq = $gb->get_Seq_by_acc($accOrGi);
> }
> $seq_out->write_seq($seq);
> $got++;
> sleep $wtime unless ($wtime == 0);
> }
>
> Using the above script, at least once a day, I get this error:
>
> Getting gi 45498451 (747/41132)...
> Getting gi 45498450 (748/41132)...
> Getting gi 45498449 (749/41132)...
>
> ------------- EXCEPTION -------------
> MSG: WebDBSeqI Request Error:
> 500 (Internal Server Error) read timeout
> Client-Date: Thu, 01 Apr 2004 13:54:45 GMT
>
>
>
> STACK Bio::DB::WebDBSeqI::_stream_request
> /usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:728
> STACK Bio::DB::WebDBSeqI::get_seq_stream
> /usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:460
> STACK Bio::DB::WebDBSeqI::get_Stream_by_gi
> /usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:330
> STACK Bio::DB::WebDBSeqI::get_Seq_by_gi
> /usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:210
> STACK toplevel /home/wes/proj/genbank/getGenbank.pl:50
>
> --------------------------------------
>
> -------------------- WARNING ---------------------
> MSG: gi (45498449) does not exist
> ---------------------------------------------------
>
> ------------- EXCEPTION -------------
> MSG: Attempting to write with no seq!
> STACK Bio::SeqIO::genbank::write_seq
> /usr/local/lib/perl5/site_perl/5.6.1/Bio/SeqIO/genbank.pm:591
> STACK toplevel /home/wes/proj/genbank/getGenbank.pl:77
>
> --------------------------------------
>
> and then the script dies. I have tried wrapping the "get" part
> inside an eval like this:
>
> foreach $accOrGi (@accOrGiList) {
> eval {
> if ($accOrGi =~ /^\d+/) {
> print("Getting gi $accOrGi ($got/$#accOrGiList)...\n");
> $seq = $gb->get_Seq_by_gi($accOrGi);
> }
> else {
> print("Getting accession $accOrGi ($got/$#accOrGiList)...\n");
> $seq = $gb->get_Seq_by_acc($accOrGi);
> }
> };
> if ($@) {
> print("$accOrGi failed.\n");
> }
> .
> .
> .
>
> This prevents the script from dieing, but it causes several other problems:
> - it spawns a second instance of my script for each error
> - STDOUT stops being echoed to the terminal (I no longer see any
> printed output) - it starts getting duplicate sequences
>
> Is there a reliable way to retrieve sequeces from NCBI?
>
> I'm using bioperl-1.4.
--
______ _/ _/_____________________________________________________
_/ _/ http://www.ebi.ac.uk/mutations/
_/ _/ _/ Heikki Lehvaslaiho heikki_at_ebi ac uk
_/_/_/_/_/ EMBL Outstation, European Bioinformatics Institute
_/ _/ _/ Wellcome Trust Genome Campus, Hinxton
_/ _/ _/ Cambs. CB10 1SD, United Kingdom
_/ Phone: +44 (0)1223 494 644 FAX: +44 (0)1223 494 468
___ _/_/_/_/_/________________________________________________________
More information about the Bioperl-l
mailing list