[Bioperl-l] Reliability of get_Seq_by_gi
Rob Edwards
redwards at utmem.edu
Sun Apr 4 18:45:56 EDT 2004
Just a suggestion, but have you tried making sure that $seq exists
before writing. Something like this should fix the problem:
}
if ($seq) {$seq_out->write_seq($seq)}
else {print STDERR "Couldn't retrieve a sequence for $accOrGi\n"}
$got++;
sleep $wtime unless ($wtime == 0);
}
I don't think $seq exists if the server times out.
You should capture all the failed sequences and then validate them in
case there is some error with the Acc/GI (like they doesn't exist any
more).
Rob
On Apr 4, 2004, at 5:22 PM, Wes Barris wrote:
> Heikki Lehvaslaiho wrote:
>> Wes,
>> See:
>> http://bioperl.org/pipermail/bioperl-l/2004-March/015423.html
>> Yours,
>> -Heikki
>
> Yes, but as I pointed out in my message, there are some problems in
> doing that.
>
>> On Thursday 01 Apr 2004 22:02, Wes Barris wrote:
>>> Hi,
>>>
>>> I have been struggling for a long time with a bioperl script that
>>> retrieves a series of sequences in genbank format. The problem
>>> is that there is the occasional error in downloading a sequence
>>> which results in a fatal error in my script. I need to be able
>>> to gracefully recover from a downloading error such that my
>>> script continues to run.
>>>
>>> Here is my script:
>>>
>>> #!/usr/local/bin/perl -w
>>> #
>>> #
>>> use Bio::DB::GenBank;
>>> my $usage = "Usage: $0 <gi|accession|infile> <out.gb|out.fa>\n";
>>> my $accOrGi = shift or die $usage;
>>> my $outfile = shift or die $usage;
>>> my $wtime = 73;
>>> #
>>> # Prepare input.
>>> #
>>> my @accOrGiList = (); # empty the list
>>> if (-r $accOrGi) {
>>> open(IN, $accOrGi);
>>> while (<IN>) {
>>> chomp;
>>> my @junk = split(/\s+/);
>>> push(@accOrGiList, $junk[0]);
>>> }
>>> close(IN);
>>> }
>>> else {
>>> push(@accOrGiList, $accOrGi);
>>> }
>>> #
>>> # Prepare output.
>>> #
>>> my $seq_out;
>>> if ($outfile =~ /\.gb$/) {
>>> $seq_out = Bio::SeqIO->new( -format=>'genbank',
>>> -file=>">$outfile");
>>> }
>>> elsif ($outfile =~ /\.fa$/) {
>>> $seq_out = Bio::SeqIO->new( -format=>'fasta', -file=>">$outfile");
>>> }
>>> else {
>>> die "Unrecognized output file extension. Use either .gb or .fa\n";
>>> }
>>> #
>>> # Get the entries.
>>> #
>>> my $gb = Bio::DB::GenBank->new();
>>> my $seq;
>>> my $got = 0;
>>> foreach $accOrGi (@accOrGiList) {
>>> if ($accOrGi =~ /^\d+/) {
>>> print("Getting gi $accOrGi ($got/$#accOrGiList)...\n");
>>> $seq = $gb->get_Seq_by_gi($accOrGi);
>>> }
>>> else {
>>> print("Getting accession $accOrGi ($got/$#accOrGiList)...\n");
>>> $seq = $gb->get_Seq_by_acc($accOrGi);
>>> }
>>> $seq_out->write_seq($seq);
>>> $got++;
>>> sleep $wtime unless ($wtime == 0);
>>> }
>>>
>>> Using the above script, at least once a day, I get this error:
>>>
>>> Getting gi 45498451 (747/41132)...
>>> Getting gi 45498450 (748/41132)...
>>> Getting gi 45498449 (749/41132)...
>>>
>>> ------------- EXCEPTION -------------
>>> MSG: WebDBSeqI Request Error:
>>> 500 (Internal Server Error) read timeout
>>> Client-Date: Thu, 01 Apr 2004 13:54:45 GMT
>>>
>>>
>>>
>>> STACK Bio::DB::WebDBSeqI::_stream_request
>>> /usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:728
>>> STACK Bio::DB::WebDBSeqI::get_seq_stream
>>> /usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:460
>>> STACK Bio::DB::WebDBSeqI::get_Stream_by_gi
>>> /usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:330
>>> STACK Bio::DB::WebDBSeqI::get_Seq_by_gi
>>> /usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:210
>>> STACK toplevel /home/wes/proj/genbank/getGenbank.pl:50
>>>
>>> --------------------------------------
>>>
>>> -------------------- WARNING ---------------------
>>> MSG: gi (45498449) does not exist
>>> ---------------------------------------------------
>>>
>>> ------------- EXCEPTION -------------
>>> MSG: Attempting to write with no seq!
>>> STACK Bio::SeqIO::genbank::write_seq
>>> /usr/local/lib/perl5/site_perl/5.6.1/Bio/SeqIO/genbank.pm:591
>>> STACK toplevel /home/wes/proj/genbank/getGenbank.pl:77
>>>
>>> --------------------------------------
>>>
>>> and then the script dies. I have tried wrapping the "get" part
>>> inside an eval like this:
>>>
>>> foreach $accOrGi (@accOrGiList) {
>>> eval {
>>> if ($accOrGi =~ /^\d+/) {
>>> print("Getting gi $accOrGi ($got/$#accOrGiList)...\n");
>>> $seq = $gb->get_Seq_by_gi($accOrGi);
>>> }
>>> else {
>>> print("Getting accession $accOrGi
>>> ($got/$#accOrGiList)...\n");
>>> $seq = $gb->get_Seq_by_acc($accOrGi);
>>> }
>>> };
>>> if ($@) {
>>> print("$accOrGi failed.\n");
>>> }
>>> .
>>> .
>>> .
>>>
>>> This prevents the script from dieing, but it causes several other
>>> problems:
>>> - it spawns a second instance of my script for each error
>>> - STDOUT stops being echoed to the terminal (I no longer see any
>>> printed output) - it starts getting duplicate sequences
>>>
>>> Is there a reliable way to retrieve sequeces from NCBI?
>>>
>>> I'm using bioperl-1.4.
>
>
> --
> Wes Barris
> E-Mail: Wes.Barris at csiro.au
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list