[Bioperl-l] Remote BLAST returning lots of 500 time out errors
Barry Moore
barry.moore at genetics.utah.edu
Sat Feb 7 23:24:48 EST 2004
I have a script that uses Bio::Tools::Run::RemoteBlast to BLAST the
translations of all ORFs from a mRNA transcript against the database. It
works fine, except that if I run several sequences at once, after about
50 ORFs worth of BLASTing, NCBI starts to return errors (500 read
time-out) for every job submitted. I can't figure out what's going on
here. Any ideas?
The script is kind of long and it take several minutes to get to the
errors, but if anyone wants to try to recreate the error I've attached
the code below. Some of you will probably recognize bits of your code
that I've pilfered from various Bioperl docs. I'm running Bioperl 1.4,
ActiveState perl 5.8.0.805 on Windows XP. I get the error by running:
perl ORF_BLAST1.pl –min_length 150 NM_001112 NM_007327 NM_015833 NM_021569
Barry Moore
Dept. Human Genetics
University of Utah
------------------------------------------------------------------------------------------------------
#!/usr/bin/perl
#ORF_BLAST1.pl
#See end of file for POD documentation
use strict;
use warnings;
use GD;
use Getopt::Long;
use Bio::SeqIO;
use Bio::PrimarySeq;
use Bio::DB::GenBank;
use Bio::Tools::Run::RemoteBlast;
#Give documentation when requested, or when missing command line arguments.
if ( ! $ARGV[0] || $ARGV[0] =~ /^-{1,2}(h|help|\?)$/i ) {
system ( "perldoc", $0 ) and die "For usage, use perldoc $0\n";
exit( 0 );
}
#Declare module level variables.
my $in_filename; #This variable holds the filename of the input file.
my $out_filename; #Suprisingly this variable holds the filname of the
output file.
my $format; #This variable defines the output format (png or jpg).
my $min_length; #This variable defines the minimum ORF length to plot.
my $require_start; #This boolean varible identifies of an ORF must begin
with a start.
my $seqio; #This variable holds the SeqIO object.
#Handle command line options.
GetOptions (
'in_file:s' => \$in_filename,
'out_file:s' => \$out_filename,
'format:s' => \$format,
'min_length:i' => \$min_length,
'start!' => \$require_start
);
my @accession = @ARGV;
#Set defaults.
$format ||= 'jpg';
$min_length ||= 150;
$require_start ||= 0;
#Define new SeqIO object. Take input from a file first if a
#filename has been specified. Otherwise take input from accession
numbers off
#the command line (but don't try to do both)
if ($in_filename) {
$seqio = Bio::SeqIO->new(-format=>'fasta', -file=>$in_filename) or die
"could not create Bio::SeqIO";
}
elsif (@accession) {
my $gb = new Bio::DB::GenBank();
$seqio = $gb->get_Stream_by_id(\@accession);
}
#Remote-BLAST factory object creation and blast-parameter initialization
my $BLAST_factory = Bio::Tools::Run::RemoteBlast->new('-prog' => 'blastp',
'-data' => 'nr',
'-expect' => '10',
'-readmethod' => 'SearchIO' );
#Main program loop to loop over all sequences input.
while (my $seq_obj = $seqio->next_seq) {
#Assign sequence specfic variables
my @starts = ();
my @stops = ();
my @orfs = ();
my $out_filename;
my $sequence = $seq_obj->seq;
my $sequence_length = length($sequence);
my $header = $seq_obj->display_name."|".$seq_obj->desc;
#Internal loop to find starts, stops, and ORFs
for my $count1 (0 .. $sequence_length) {
my $open = 0;
if ($count1 < 4) {$open = 1}
my $count2 = $count1;
my $codon = substr($sequence, $count1, 3);
my $frame = $count1 % 3; #Get the modulus of $count1/3 for ascertaining
the frame
#Convert the modulus above into frame 1, 2, or 3.
if ($frame == 1) {$frame = 2}
elsif ($frame == 2) {$frame = 3}
elsif ($frame == 0) {$frame = 1}
#Add starts to stack.
if ($codon =~ /ATG/i) {
push @starts, {start => $count1,
frame => $frame};
if ($require_start == 1) {$open = 1} #Open the ORF flag.
}
#Add stops to stack.
if ($codon =~ /TGA|TAG|TAA/i) {
push @stops, {stop => $count1,
frame => $frame};
if ($require_start == 0) {$open = 1} #Open the ORF flag.
}
#Find extend of ORF if one has been opened by either of the above
#conditionals.
if ($open == 1) {
$codon = "";
my $count2 = $count1;
#Loop to step forward through ORF looking for next in frame stop.
while (($codon !~ /TGA|TAG|TAA/i) and ($count2 < $sequence_length - 4)) {
$count2 = $count2 + 3; #Keep it in frame.
$codon = substr($sequence, $count2, 3);
}
#Make sure the ORF is long enough to count...
if ($count2 - $count1 >= $min_length) {
push @orfs, {begin => $count1,
end => $count2,
frame => $frame
}; #...then push it onto the ORF stack.
}
}
}
@orfs || die "No ORFs of $min_length nucleotide in length found";
#Loop to BLAST each ORF against the database, and check for a hit.
my $BLAST_count;
for my $orf (@orfs) {
#Assign ORF specific variables.
my $begin = $$orf{begin};
my $end = $$orf{end};
my $frame = $$orf{frame};
$BLAST_count++;
#Initialize subsequence as new sequence
my $seq = new Bio::PrimarySeq
(-seq => $seq_obj->subseq($begin + 1, $end),
-display_id => "${frame}_${begin}_${end}");
#Translate sequence
my $trans = $seq->translate();
#Blast the sequence against a database:
my $job = $BLAST_factory->submit_blast($trans);
print STDERR "Blasting ORF ",$BLAST_count," of ", scalar @orfs, "...";
#Loop to load the RIDs returned for the BLAST job submitted (this probably
#doesn't need to be a loop here but I won't take it out yet)
while ( my @rids = $BLAST_factory->each_rid ) {
#Loop iterate over RIDs, and hit NCBI's BLAST server for a result
foreach my $rid ( @rids ) {
#Hit the server for a result on RID.
my $blast_results = $BLAST_factory->retrieve_blast($rid);
#Was a result returned?
if( !ref($blast_results) ) {
#If so and it returned an error remove that RID from the stack
if ($blast_results < 0) {
$BLAST_factory->remove_rid($rid);
}
print STDERR "."; #Keep the user staring at the dots.
sleep 5; #Plays nice with the servers.
}
#If a result was returned and it isn't an error, then pass it to a
#variable...
else {
my $result = $blast_results->next_result();
$BLAST_factory->remove_rid($rid); #...and remove it's RID from the stack.
#Check the result for a hit...
my $hit = $result->next_hit;
if (ref($hit)) {
my $hsp = $hit->next_hsp;
#...collect it's evalue from the hsp object, and add to the ORFs hash
$$orf{evalue} = $hsp->evalue();
}
#If no evalue found, default to 100 to keep undef from looking like a
#significant e-value.
else {$$orf{evalue} = 100}
print "\n";
}
}
}
}
#Main block to draw image.
my $image = new GD::Image(900, 150); #Create a new image.
#Allocate some colors.
my %color = (
white => $image->colorAllocate(255,255,255),
aqua => $image->colorAllocate(0,255,255),
black => $image->colorAllocate(0,0,0),
blue => $image->colorAllocate(0,0,255),
gray => $image->colorAllocate(128,128,128),
fuchsia => $image->colorAllocate(255,0,255),
green => $image->colorAllocate(0,255,0),
lime => $image->colorAllocate(0,255,255),
maroon => $image->colorAllocate(128,0,0),
navy => $image->colorAllocate(0,0,128),
olive => $image->colorAllocate(128,128,0),
purple => $image->colorAllocate(128,0,128),
red => $image->colorAllocate(255,0,0),
silver => $image->colorAllocate(192,192,192),
teal => $image->colorAllocate(0,128,128),
yellow1 => $image->colorAllocate(255,255,0),
yellow2 => $image->colorAllocate(200,200,0),
yellow3 => $image->colorAllocate(150,150,0)
);
#Make the background transparent and interlaced.
$image->transparent($color{white});
$image->interlaced('true');
#Put a black frame around the picture.
$image->rectangle(0,0,899,149,$color{black});
#Add the title line.
$image->string(gdGiantFont,10,10,$header,$color{black});
#Draw the lines for each frame.
$image->line(10,50,890,50,$color{black});
$image->line(10,75,890,75,$color{black});
$image->line(10,100,890,100,$color{black});
#Draw a line for the ruler.
$image->line(10,125,890,125,$color{black});
#Loop to add ruler ticks and numbers to image.
for my $tick (0 .. 10) {
#Convert sequence coordniates to image X-asix values.
$tick = $sequence_length/10*$tick;
$tick = convert($tick, $sequence_length);
#Add ruler ticks.
$image->line($tick,125,$tick,130,$color{black});
#Add nubmers to ruler.
$image->string(gdSmallFont,$tick-(2*length($tick-15)),130,$tick-15,$color{black});
}
#Loop to add ORFs to image.
for my $orf (@orfs) {
my $top; #The variable sets the top of ORF rectangle.
my $bottom; #This varibale sets the bottom of ORF rectangle.
#Asign the Y coordinates for the ORF to place them in the correct frame.
if ($$orf{frame} == 1) {$top = 40; $bottom = 60}
elsif ($$orf{frame} == 2) {$top = 65; $bottom = 85}
elsif ($$orf{frame} == 3) {$top = 90; $bottom = 110}
#Convert sequence coordniates to image X-axis values.
my $begin = convert($$orf{begin}, $sequence_length);
my $end = convert($$orf{end}, $sequence_length);
#Asign a shade of yellow to the ORF if the BLAST on that ORF returned an
#evaule.
my $orf_color = $color{black}; #Default ORF color to black.
if (defined $$orf{evalue}) {
if ($$orf{evalue} <= 10) {$orf_color = $color{yellow3}} #Dark yellow
if ($$orf{evalue} <= 1.0e-3) {$orf_color = $color{yellow2}} #Meduim yellow
if ($$orf{evalue} < 1.0e-25) {$orf_color = $color{yellow1}} #Bright yellow
}
#Draw rectangles for the ORFs.
$image->filledRectangle($begin,$top,$end,$bottom,$orf_color);
#Print the e-value on the ORF if it is below 10.
if ($$orf{evalue} < 10) {
$image->string(gdSmallFont,$begin + 3,$top + 2,$$orf{evalue},$color{black});
}
}
#Add green ticks to the image for each start.
for my $start (@starts) {
my $top; #This variable sets the top of the start line.
my $bottom; #This varibale sets the bottom of the start line.
#Assign the Y coordinates for the start line to put it in the correct frame.
if ($$start{frame} == 1) {$top = 50; $bottom = 60}
elsif ($$start{frame} == 2) {$top = 75; $bottom = 85}
elsif ($$start{frame} == 3) {$top = 100; $bottom = 110}
#Convert sequence coordniates to image X-axis values.
my $location = convert($$start{start}, $sequence_length);
#Draw the start ticks.
$image->line($location,$top,$location,$bottom,$color{green});
}
#Add red ticks to the image for each stop.
for my $stop (@stops) {
my $top; #This variable sets the top of the stop line.
my $bottom; #This varibale sets the bottom of the stop line.
#Assign the Y coordinates for the stop line to put it in the correct frame.
if ($$stop{frame} == 1) {$top = 40; $bottom = 60}
elsif ($$stop{frame} == 2) {$top = 65; $bottom = 85}
elsif ($$stop{frame} == 3) {$top = 90; $bottom = 110}
#Convert sequence coordniates to image X-asix values.
my $location = convert($$stop{stop}, $sequence_length);
#Draw the stop ticks.
$image->line($location,$top,$location,$bottom,$color{red});
}
#Set a default output filename if none was set on the command line.
if (! $out_filename) {
if ( $in_filename &&=~ /(.*?)\..*/) {
$out_filename = $1.".".$format;
}
elsif ( $seq_obj->primary_id !~ /unknown/) {
$out_filename = $seq_obj->display_name().".".$format;
}
else {
$out_filename = $seq_obj->primary_id().".".$format;
}
}
#Open a filehandle for output and make sure we are writing a binary stream.
open (OUT, ">$out_filename");
binmode OUT;
#Write the image to a file in specified format.
if ($format =~ /jpg|jpeg/) {
print OUT $image->jpeg;
}
if ($format =~ /png/) {
print OUT $image->png;
}
close OUT;
}
#A subroutine to convert sequence coordinates to x-axis values on the image.
sub convert {
my ($value, $length) = @_;
$value = (($value/$length)*870)+15; #Convert a sequence length value to
an X-axis value.
$value = sprintf("%.0f", $value); #Round $value to nearest integer.
return $value;
}
=head1 NAME
ORF_BLAST1.pl
=head1 SYNOPSIS
perl ORF_Plot.pl [--options] NM_007327
=head1 DESCRIPTION
This program will take a sequence file as input, and generate a
graphical output
of it's ORF architecture in 3 frames plotting ORFs, start codons (ATG)
and stop
codons. It will BLAST the translation of each ORF against NCBI, and
color the
shades of yellow to black, depending on the e-value returned for that ORF.
INPUT:
Input can be a list of space seperated accession numbers on the command
line,
or a fasta file.
OUTPUT:
Output is a figure saved as either a png or jpg file to the current
directory.
OPTIONS:
Several options can be specified, but all are optional.
--in_file filename
Use to set the input file name. The file that contains the input
sequences in fasta format.
--out_file filename
Use to set the output file name. Defaults to input file name, then
Bioperl's display name (usually the accession number), then Bioperl's
accession number (usually the gi number).
--min_size integer
Use to set the minimum ORF size that will be plotted in the figure.
--start
Use to require plotted ORFs to begin with a start
--format
Use to set the output format. Valid values are png or jpg (or jpeg).
Defaults to jpg.
=head1 USING
perl ORF_BLAST1.pl --min_size 300 --start --format png NM_001112
NM_007327 NM_015833
or
perl ORF_BLAST1.pl --in_file sequence.fasta --out_file image_file
--min_size 300 --start
=head1 REQUIRES
GD
Getopt::Long
Bio::SeqIO
Bio::PrimarySeq
Bio::DB::GenBank
Bio::Tools::Run::RemoteBlast
=head1 AUTHOR
Barry Moore
Department of Human Genetics
University of Utah
Salt Lake City, UT 84112
USA
Address bug reports and comments to: barry.moore at genetics.utah.edu
=head1 BUGS
Currently after about 50 ORFs BLASTed, NCBI starts to return time-out
errors.
=head1 FUTURE DIRECTIONS
Add command line options for the BLAST parameters.
=head1 COPYRIGHT
Copyright 2003, Barry Moore. All rights reserved.
This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself.
=head1 SEE ALSO
=cut
More information about the Bioperl-l
mailing list