[Bioperl-l] Reliability of get_Seq_by_gi
Wes Barris
wes.barris at csiro.au
Thu Apr 1 17:02:36 EST 2004
Hi,
I have been struggling for a long time with a bioperl script that
retrieves a series of sequences in genbank format. The problem
is that there is the occasional error in downloading a sequence
which results in a fatal error in my script. I need to be able
to gracefully recover from a downloading error such that my
script continues to run.
Here is my script:
#!/usr/local/bin/perl -w
#
#
use Bio::DB::GenBank;
my $usage = "Usage: $0 <gi|accession|infile> <out.gb|out.fa>\n";
my $accOrGi = shift or die $usage;
my $outfile = shift or die $usage;
my $wtime = 73;
#
# Prepare input.
#
my @accOrGiList = (); # empty the list
if (-r $accOrGi) {
open(IN, $accOrGi);
while (<IN>) {
chomp;
my @junk = split(/\s+/);
push(@accOrGiList, $junk[0]);
}
close(IN);
}
else {
push(@accOrGiList, $accOrGi);
}
#
# Prepare output.
#
my $seq_out;
if ($outfile =~ /\.gb$/) {
$seq_out = Bio::SeqIO->new( -format=>'genbank', -file=>">$outfile");
}
elsif ($outfile =~ /\.fa$/) {
$seq_out = Bio::SeqIO->new( -format=>'fasta', -file=>">$outfile");
}
else {
die "Unrecognized output file extension. Use either .gb or .fa\n";
}
#
# Get the entries.
#
my $gb = Bio::DB::GenBank->new();
my $seq;
my $got = 0;
foreach $accOrGi (@accOrGiList) {
if ($accOrGi =~ /^\d+/) {
print("Getting gi $accOrGi ($got/$#accOrGiList)...\n");
$seq = $gb->get_Seq_by_gi($accOrGi);
}
else {
print("Getting accession $accOrGi ($got/$#accOrGiList)...\n");
$seq = $gb->get_Seq_by_acc($accOrGi);
}
$seq_out->write_seq($seq);
$got++;
sleep $wtime unless ($wtime == 0);
}
Using the above script, at least once a day, I get this error:
Getting gi 45498451 (747/41132)...
Getting gi 45498450 (748/41132)...
Getting gi 45498449 (749/41132)...
------------- EXCEPTION -------------
MSG: WebDBSeqI Request Error:
500 (Internal Server Error) read timeout
Client-Date: Thu, 01 Apr 2004 13:54:45 GMT
STACK Bio::DB::WebDBSeqI::_stream_request
/usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:728
STACK Bio::DB::WebDBSeqI::get_seq_stream
/usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:460
STACK Bio::DB::WebDBSeqI::get_Stream_by_gi
/usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:330
STACK Bio::DB::WebDBSeqI::get_Seq_by_gi
/usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:210
STACK toplevel /home/wes/proj/genbank/getGenbank.pl:50
--------------------------------------
-------------------- WARNING ---------------------
MSG: gi (45498449) does not exist
---------------------------------------------------
------------- EXCEPTION -------------
MSG: Attempting to write with no seq!
STACK Bio::SeqIO::genbank::write_seq
/usr/local/lib/perl5/site_perl/5.6.1/Bio/SeqIO/genbank.pm:591
STACK toplevel /home/wes/proj/genbank/getGenbank.pl:77
--------------------------------------
and then the script dies. I have tried wrapping the "get" part
inside an eval like this:
foreach $accOrGi (@accOrGiList) {
eval {
if ($accOrGi =~ /^\d+/) {
print("Getting gi $accOrGi ($got/$#accOrGiList)...\n");
$seq = $gb->get_Seq_by_gi($accOrGi);
}
else {
print("Getting accession $accOrGi ($got/$#accOrGiList)...\n");
$seq = $gb->get_Seq_by_acc($accOrGi);
}
};
if ($@) {
print("$accOrGi failed.\n");
}
.
.
.
This prevents the script from dieing, but it causes several other problems:
- it spawns a second instance of my script for each error
- STDOUT stops being echoed to the terminal (I no longer see any printed output)
- it starts getting duplicate sequences
Is there a reliable way to retrieve sequeces from NCBI?
I'm using bioperl-1.4.
--
Wes Barris
E-Mail: Wes.Barris at csiro.au
More information about the Bioperl-l
mailing list