[Bioperl-l] Reliability of get_Seq_by_gi
Wes Barris
wes.barris at csiro.au
Mon Apr 5 18:44:12 EDT 2004
Rob Edwards wrote:
> Just a suggestion, but have you tried making sure that $seq exists
> before writing. Something like this should fix the problem:
>
> }
> if ($seq) {$seq_out->write_seq($seq)}
> else {print STDERR "Couldn't retrieve a sequence for $accOrGi\n"}
> $got++;
> sleep $wtime unless ($wtime == 0);
> }
>
> I don't think $seq exists if the server times out.
>
> You should capture all the failed sequences and then validate them in
> case there is some error with the Acc/GI (like they doesn't exist any
> more).
>
> Rob
Hi Rob,
Your solution appears to have worked. I was able to download sequences
over night and my script was still running this morning -- the first time
it has ever done that. Thanks.
>
>
>
> On Apr 4, 2004, at 5:22 PM, Wes Barris wrote:
>
>> Heikki Lehvaslaiho wrote:
>>
>>> Wes,
>>> See:
>>> http://bioperl.org/pipermail/bioperl-l/2004-March/015423.html
>>> Yours,
>>> -Heikki
>>
>>
>> Yes, but as I pointed out in my message, there are some problems in
>> doing that.
>>
>>> On Thursday 01 Apr 2004 22:02, Wes Barris wrote:
>>>
>>>> Hi,
>>>>
>>>> I have been struggling for a long time with a bioperl script that
>>>> retrieves a series of sequences in genbank format. The problem
>>>> is that there is the occasional error in downloading a sequence
>>>> which results in a fatal error in my script. I need to be able
>>>> to gracefully recover from a downloading error such that my
>>>> script continues to run.
>>>>
>>>> Here is my script:
>>>>
>>>> #!/usr/local/bin/perl -w
>>>> #
>>>> #
>>>> use Bio::DB::GenBank;
>>>> my $usage = "Usage: $0 <gi|accession|infile> <out.gb|out.fa>\n";
>>>> my $accOrGi = shift or die $usage;
>>>> my $outfile = shift or die $usage;
>>>> my $wtime = 73;
>>>> #
>>>> # Prepare input.
>>>> #
>>>> my @accOrGiList = (); # empty the list
>>>> if (-r $accOrGi) {
>>>> open(IN, $accOrGi);
>>>> while (<IN>) {
>>>> chomp;
>>>> my @junk = split(/\s+/);
>>>> push(@accOrGiList, $junk[0]);
>>>> }
>>>> close(IN);
>>>> }
>>>> else {
>>>> push(@accOrGiList, $accOrGi);
>>>> }
>>>> #
>>>> # Prepare output.
>>>> #
>>>> my $seq_out;
>>>> if ($outfile =~ /\.gb$/) {
>>>> $seq_out = Bio::SeqIO->new( -format=>'genbank', -file=>">$outfile");
>>>> }
>>>> elsif ($outfile =~ /\.fa$/) {
>>>> $seq_out = Bio::SeqIO->new( -format=>'fasta', -file=>">$outfile");
>>>> }
>>>> else {
>>>> die "Unrecognized output file extension. Use either .gb or .fa\n";
>>>> }
>>>> #
>>>> # Get the entries.
>>>> #
>>>> my $gb = Bio::DB::GenBank->new();
>>>> my $seq;
>>>> my $got = 0;
>>>> foreach $accOrGi (@accOrGiList) {
>>>> if ($accOrGi =~ /^\d+/) {
>>>> print("Getting gi $accOrGi ($got/$#accOrGiList)...\n");
>>>> $seq = $gb->get_Seq_by_gi($accOrGi);
>>>> }
>>>> else {
>>>> print("Getting accession $accOrGi ($got/$#accOrGiList)...\n");
>>>> $seq = $gb->get_Seq_by_acc($accOrGi);
>>>> }
>>>> $seq_out->write_seq($seq);
>>>> $got++;
>>>> sleep $wtime unless ($wtime == 0);
>>>> }
>>>>
>>>> Using the above script, at least once a day, I get this error:
>>>>
>>>> Getting gi 45498451 (747/41132)...
>>>> Getting gi 45498450 (748/41132)...
>>>> Getting gi 45498449 (749/41132)...
>>>>
>>>> ------------- EXCEPTION -------------
>>>> MSG: WebDBSeqI Request Error:
>>>> 500 (Internal Server Error) read timeout
>>>> Client-Date: Thu, 01 Apr 2004 13:54:45 GMT
>>>>
>>>>
>>>>
>>>> STACK Bio::DB::WebDBSeqI::_stream_request
>>>> /usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:728
>>>> STACK Bio::DB::WebDBSeqI::get_seq_stream
>>>> /usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:460
>>>> STACK Bio::DB::WebDBSeqI::get_Stream_by_gi
>>>> /usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:330
>>>> STACK Bio::DB::WebDBSeqI::get_Seq_by_gi
>>>> /usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:210
>>>> STACK toplevel /home/wes/proj/genbank/getGenbank.pl:50
>>>>
>>>> --------------------------------------
>>>>
>>>> -------------------- WARNING ---------------------
>>>> MSG: gi (45498449) does not exist
>>>> ---------------------------------------------------
>>>>
>>>> ------------- EXCEPTION -------------
>>>> MSG: Attempting to write with no seq!
>>>> STACK Bio::SeqIO::genbank::write_seq
>>>> /usr/local/lib/perl5/site_perl/5.6.1/Bio/SeqIO/genbank.pm:591
>>>> STACK toplevel /home/wes/proj/genbank/getGenbank.pl:77
>>>>
>>>> --------------------------------------
>>>>
>>>> and then the script dies. I have tried wrapping the "get" part
>>>> inside an eval like this:
>>>>
>>>> foreach $accOrGi (@accOrGiList) {
>>>> eval {
>>>> if ($accOrGi =~ /^\d+/) {
>>>> print("Getting gi $accOrGi ($got/$#accOrGiList)...\n");
>>>> $seq = $gb->get_Seq_by_gi($accOrGi);
>>>> }
>>>> else {
>>>> print("Getting accession $accOrGi ($got/$#accOrGiList)...\n");
>>>> $seq = $gb->get_Seq_by_acc($accOrGi);
>>>> }
>>>> };
>>>> if ($@) {
>>>> print("$accOrGi failed.\n");
>>>> }
>>>> .
>>>> .
>>>> .
>>>>
>>>> This prevents the script from dieing, but it causes several other
>>>> problems:
>>>> - it spawns a second instance of my script for each error
>>>> - STDOUT stops being echoed to the terminal (I no longer see any
>>>> printed output) - it starts getting duplicate sequences
>>>>
>>>> Is there a reliable way to retrieve sequeces from NCBI?
>>>>
>>>> I'm using bioperl-1.4.
>>
>>
>>
>> --
>> Wes Barris
>> E-Mail: Wes.Barris at csiro.au
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at portal.open-bio.org
>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
>
--
Wes Barris
E-Mail: Wes.Barris at csiro.au
More information about the Bioperl-l
mailing list