[Bioperl-l] help using SEARCH IO to parse blast results
alison waller
alison.waller at utoronto.ca
Tue Nov 27 21:32:07 UTC 2007
Thanks Everyone,
Your edits worked Dave, however after looking at the output I realized that
I only want information on the top hsp per query returned. For example some
of the querys the top hit has two hsps so it returned both.
I tried to further edit it, but after 3 attempts they are all failing, I
think due to me using the loops wrong.
I also have another problem, I also want to retrieve the gi, this doesn't
seem to be straight forward as it should. I found another method
_get_seq_identifiers, but this looks awkward, isn't there and object for the
gi?
I've pasted my non-working script below if there are any suggestions on how
to get it to print out just the first hsp per hit, that would be great.
Thanks,
#!/usr/local/bin/perl -w
# Parsing BLAST reports with BioPerl's Bio::SearchIO module
# alison waller November 2007
use strict;
use warnings;
use Bio::SearchIO;
my $usage = "to run type: blast_parse_aw.pl <blast report> <# of hits>\n";
if (@ARGV != 2) { die $usage; }
my $infile = $ARGV[0];
my $outfile = $infile . '.parsed';
my $tophit = $ARGV[1]; # to specify in the command line how many hits
# to report for each query
#open( IN, $infile ) || die "Can't open inputfile $infile! $!\n";
open( OUT, ">$outfile" ) || die "Can't open outputfile $outfile! $!\n";
my $report = new Bio::SearchIO(
-file => "$infile",
-format => "blast"
);
print OUT
"Query\tHitDesc\tHitSignif\tHitBits\tEvalue\t%id\tAlignLen\tNumIdent\tgaps\t
strand\tHstrand\n";
# Go through BLAST reports one by one
while (my $result = $report->next_result) {
my $i=0;
while( ( $i++<$tophit) && (my $hit = $result->next_hit)){
while ( ( $i++ < $tophit ) && (my $hsp = $hit->next_hsp) ) {
# Print some tab-delimited data about this hit
print OUT $result->query_name, "\t";
print OUT $hit->name, "\t";
print OUT $hit->significance, "\t";
print OUT $hit->bits, "\t";
print OUT $hsp->evalue, "\t";
print OUT $hsp->percent_identity, "\t";
print OUT $hsp->length('total'), "\t";
print OUT $hsp->num_identical, "\t";
print OUT $hsp->gaps, "\t";
print OUT $hsp->strand('query'), "\t";
print OUT $hsp->strand('hit'), "\n";
}
}
if ($i == 0) { print OUT "no hits\n"; }
}
-----Original Message-----
From: Chris Fields [mailto:cjfields at uiuc.edu]
Sent: Tuesday, November 27, 2007 4:01 PM
To: Smithies, Russell
Cc: Dave Messina; alison waller; bioperl-l at lists.open-bio.org
Subject: Re: [Bioperl-l] help using SEARCH IO to parse blast results
The hits/HSPs are generally in the order they appear in the report.
If you are looking for best/worst HSP after parsing you can use the
$hit->hsp() method:
# best and worst
my $best = $hit->hsp('best'); # also 'first'
my $worst = $hit->hsp('worst'); # also last
The SearchIO text BLAST parser also has several options implemented
for finer control:
-inclusion_threshold => e-value threshold for inclusion in the
PSI-BLAST score matrix model (blastpgp)
-signif => float or scientific notation number to be used
as a P- or Expect value cutoff
-score => integer or scientific notation number to be used
as a blast score value cutoff
-bits => integer or scientific notation number to be used
as a bit score value cutoff
-hit_filter => reference to a function to be used for
filtering hits based on arbitrary criteria.
All hits of each BLAST report must satisfy
this criteria to be retained.
If a hit fails this test, it is ignored.
This function should take a
Bio::Search::Hit::BlastHit.pm object as its first
argument and return true
if the hit should be retained.
Sample filter function:
-hit_filter => sub { $hit = shift;
$hit->gaps == 0; },
(Note: -filt_func is synonymous with -hit_filter)
-overlap => integer. The amount of overlap to permit between
adjacent HSPs when tiling HSPs. A reasonable
value is 2.
Default = $Bio::SearchIO::blast::MAX_HSP_OVERLAP.
chris
On Nov 27, 2007, at 1:31 PM, Smithies, Russell wrote:
> Do the hits need to be sorted first or is this done automagicly?
> I ask this as I know Megablast doesn't provide sorted output for
> most of
> it's formats.
>
> Russell
>
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org
> [mailto:bioperl-l-bounces at lists.open-
>> bio.org] On Behalf Of Dave Messina
>> Sent: Wednesday, 28 November 2007 6:56 a.m.
>> To: alison waller
>> Cc: bioperl-l at lists.open-bio.org
>> Subject: Re: [Bioperl-l] help using SEARCH IO to parse blast results
>>
>> Hi Alison,
>> As Sendu mentioned, the key bit is adding a condition to the hit loop
> to
>> limit the number of hits that are printed. I didn't test the below
>> extensively, but give it a try...
>>
>>
>> Dave
>>
>>
>>
>> #!/usr/local/bin/perl -w
>>
>> # Parsing BLAST reports with BioPerl's Bio::SearchIO module
>> # alison waller November 2007
>>
>> use strict;
>> use warnings;
>> use Bio::SearchIO;
>>
>> my $usage = "to run type: blast_parse_aw.pl <blast report> <# of
> hits>\n";
>> if (@ARGV != 2) { die $usage; }
>>
>> my $infile = $ARGV[0];
>> my $outfile = $infile . '.parsed';
>> my $tophit = $ARGV[1]; # to specify in the command line how many
>> hits
>> # to report for each query
>>
>> #open( IN, $infile ) || die "Can't open inputfile $infile! $!
>> \n";
>> open( OUT, ">$outfile" ) || die "Can't open outputfile $outfile!
> $!\n";
>>
>> my $report = new Bio::SearchIO(
>> -file => "$infile",
>> -format => "blast"
>> );
>>
>> print OUT
>>
> "Query\tHitDesc\tHitSignif\tHitBits\tEvalue\t%id\tAlignLen\tNumIdent
> \tga
> ps\t
>> Qstrand\tHstrand\n";
>>
>> # Go through BLAST reports one by one
>> while ( my $result = $report->next_result ) {
>> my $i = 0;
>> while ( ( $i++ < $tophit ) && (my $hit = $result->next_hit) ) {
>> while ( my $hsp = $hit->next_hsp ) {
>>
>> # Print some tab-delimited data about this hit
>> print OUT $result->query_name, "\t";
>> print OUT $hit->name, "\t";
>> print OUT $hit->significance, "\t";
>> print OUT $hit->bits, "\t";
>> print OUT $hsp->evalue, "\t";
>> print OUT $hsp->percent_identity, "\t";
>> print OUT $hsp->length('total'), "\t";
>> print OUT $hsp->num_identical, "\t";
>> print OUT $hsp->gaps, "\t";
>> print OUT $hsp->strand('query'), "\t";
>> print OUT $hsp->strand('hit'), "\n";
>> }
>> }
>>
>> if ($i == 0) { print OUT "no hits\n"; }
>> }
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> =
> ======================================================================
> Attention: The information contained in this message and/or
> attachments
> from AgResearch Limited is intended only for the persons or entities
> to which it is addressed and may contain confidential and/or
> privileged
> material. Any review, retransmission, dissemination or other use of,
> or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipients is prohibited by
> AgResearch
> Limited. If you have received this message in error, please notify the
> sender immediately.
> =
> ======================================================================
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
More information about the Bioperl-l
mailing list