[Bioperl-l] Reliability of get_Seq_by_gi
Wes Barris
wes.barris at csiro.au
Sun Apr 4 18:22:11 EDT 2004
Heikki Lehvaslaiho wrote:
> Wes,
>
> See:
> http://bioperl.org/pipermail/bioperl-l/2004-March/015423.html
>
> Yours,
> -Heikki
Yes, but as I pointed out in my message, there are some problems in
doing that.
>
> On Thursday 01 Apr 2004 22:02, Wes Barris wrote:
>
>>Hi,
>>
>>I have been struggling for a long time with a bioperl script that
>>retrieves a series of sequences in genbank format. The problem
>>is that there is the occasional error in downloading a sequence
>>which results in a fatal error in my script. I need to be able
>>to gracefully recover from a downloading error such that my
>>script continues to run.
>>
>>Here is my script:
>>
>>#!/usr/local/bin/perl -w
>>#
>>#
>>use Bio::DB::GenBank;
>>my $usage = "Usage: $0 <gi|accession|infile> <out.gb|out.fa>\n";
>>my $accOrGi = shift or die $usage;
>>my $outfile = shift or die $usage;
>>my $wtime = 73;
>>#
>># Prepare input.
>>#
>>my @accOrGiList = (); # empty the list
>>if (-r $accOrGi) {
>> open(IN, $accOrGi);
>> while (<IN>) {
>> chomp;
>> my @junk = split(/\s+/);
>> push(@accOrGiList, $junk[0]);
>> }
>> close(IN);
>> }
>>else {
>> push(@accOrGiList, $accOrGi);
>> }
>>#
>># Prepare output.
>>#
>>my $seq_out;
>>if ($outfile =~ /\.gb$/) {
>> $seq_out = Bio::SeqIO->new( -format=>'genbank', -file=>">$outfile");
>> }
>>elsif ($outfile =~ /\.fa$/) {
>> $seq_out = Bio::SeqIO->new( -format=>'fasta', -file=>">$outfile");
>> }
>>else {
>> die "Unrecognized output file extension. Use either .gb or .fa\n";
>> }
>>#
>># Get the entries.
>>#
>>my $gb = Bio::DB::GenBank->new();
>>my $seq;
>>my $got = 0;
>>foreach $accOrGi (@accOrGiList) {
>> if ($accOrGi =~ /^\d+/) {
>> print("Getting gi $accOrGi ($got/$#accOrGiList)...\n");
>> $seq = $gb->get_Seq_by_gi($accOrGi);
>> }
>> else {
>> print("Getting accession $accOrGi ($got/$#accOrGiList)...\n");
>> $seq = $gb->get_Seq_by_acc($accOrGi);
>> }
>> $seq_out->write_seq($seq);
>> $got++;
>> sleep $wtime unless ($wtime == 0);
>> }
>>
>>Using the above script, at least once a day, I get this error:
>>
>>Getting gi 45498451 (747/41132)...
>>Getting gi 45498450 (748/41132)...
>>Getting gi 45498449 (749/41132)...
>>
>>------------- EXCEPTION -------------
>>MSG: WebDBSeqI Request Error:
>>500 (Internal Server Error) read timeout
>>Client-Date: Thu, 01 Apr 2004 13:54:45 GMT
>>
>>
>>
>>STACK Bio::DB::WebDBSeqI::_stream_request
>>/usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:728
>>STACK Bio::DB::WebDBSeqI::get_seq_stream
>>/usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:460
>>STACK Bio::DB::WebDBSeqI::get_Stream_by_gi
>>/usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:330
>>STACK Bio::DB::WebDBSeqI::get_Seq_by_gi
>>/usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:210
>>STACK toplevel /home/wes/proj/genbank/getGenbank.pl:50
>>
>>--------------------------------------
>>
>>-------------------- WARNING ---------------------
>>MSG: gi (45498449) does not exist
>>---------------------------------------------------
>>
>>------------- EXCEPTION -------------
>>MSG: Attempting to write with no seq!
>>STACK Bio::SeqIO::genbank::write_seq
>>/usr/local/lib/perl5/site_perl/5.6.1/Bio/SeqIO/genbank.pm:591
>>STACK toplevel /home/wes/proj/genbank/getGenbank.pl:77
>>
>>--------------------------------------
>>
>>and then the script dies. I have tried wrapping the "get" part
>>inside an eval like this:
>>
>>foreach $accOrGi (@accOrGiList) {
>> eval {
>> if ($accOrGi =~ /^\d+/) {
>> print("Getting gi $accOrGi ($got/$#accOrGiList)...\n");
>> $seq = $gb->get_Seq_by_gi($accOrGi);
>> }
>> else {
>> print("Getting accession $accOrGi ($got/$#accOrGiList)...\n");
>> $seq = $gb->get_Seq_by_acc($accOrGi);
>> }
>> };
>> if ($@) {
>> print("$accOrGi failed.\n");
>> }
>> .
>> .
>> .
>>
>>This prevents the script from dieing, but it causes several other problems:
>> - it spawns a second instance of my script for each error
>> - STDOUT stops being echoed to the terminal (I no longer see any
>>printed output) - it starts getting duplicate sequences
>>
>>Is there a reliable way to retrieve sequeces from NCBI?
>>
>>I'm using bioperl-1.4.
>
>
--
Wes Barris
E-Mail: Wes.Barris at csiro.au
More information about the Bioperl-l
mailing list