[Bioperl-l] help using SEARCH IO to parse blast results

alison waller alison.waller at utoronto.ca
Tue Nov 27 21:32:07 UTC 2007


Thanks Everyone,

Your edits worked Dave, however after looking at the output I realized that
I only want information on the top hsp per query returned.  For example some
of the querys the top hit has two hsps so it returned both.

I tried to further edit it, but after 3 attempts they are all failing, I
think due to me using the loops wrong.

I also have another problem, I also want to retrieve the gi, this doesn't
seem to be straight forward as it should.  I found another method
_get_seq_identifiers, but this looks awkward, isn't there and object for the
gi?

I've pasted my non-working script below if there are any suggestions on how
to get it to print out just the first hsp per hit, that would be great.

Thanks,


#!/usr/local/bin/perl -w


# Parsing BLAST reports with BioPerl's Bio::SearchIO module 
# alison waller November 2007


use strict;
use warnings;
use Bio::SearchIO;


my $usage = "to run type: blast_parse_aw.pl <blast report> <# of hits>\n";
if (@ARGV != 2) { die $usage; }


my $infile  = $ARGV[0];
my $outfile = $infile . '.parsed';
my $tophit  = $ARGV[1]; # to specify in the command line how many hits
                        # to report for each query


#open(  IN, $infile )     || die "Can't open inputfile $infile! $!\n";
open( OUT, ">$outfile" ) || die "Can't open outputfile $outfile! $!\n";


my $report = new Bio::SearchIO(
    -file   => "$infile",
    -format => "blast"
);


print OUT
 
"Query\tHitDesc\tHitSignif\tHitBits\tEvalue\t%id\tAlignLen\tNumIdent\tgaps\t
strand\tHstrand\n";


# Go through BLAST reports one by one
while (my $result = $report->next_result) {
	my $i=0;
	while( ( $i++<$tophit) && (my $hit = $result->next_hit)){
    	while (  ( $i++ < $tophit ) && (my $hsp = $hit->next_hsp) ) {
        

            # Print some tab-delimited data about this hit
            print OUT $result->query_name,     "\t";
            print OUT $hit->name,              "\t"; 
            print OUT $hit->significance,      "\t";
            print OUT $hit->bits,              "\t";
            print OUT $hsp->evalue,            "\t"; 
            print OUT $hsp->percent_identity,  "\t";
            print OUT $hsp->length('total'),   "\t";
            print OUT $hsp->num_identical,     "\t"; 
            print OUT $hsp->gaps,              "\t";
            print OUT $hsp->strand('query'),   "\t";
            print OUT $hsp->strand('hit'),     "\n"; 
        }
}
    if ($i == 0) { print OUT "no hits\n"; } 

}

-----Original Message-----
From: Chris Fields [mailto:cjfields at uiuc.edu] 
Sent: Tuesday, November 27, 2007 4:01 PM
To: Smithies, Russell
Cc: Dave Messina; alison waller; bioperl-l at lists.open-bio.org
Subject: Re: [Bioperl-l] help using SEARCH IO to parse blast results

The hits/HSPs are generally in the order they appear in the report.

If you are looking for best/worst HSP after parsing you can use the  
$hit->hsp() method:

# best and worst
my $best = $hit->hsp('best'); # also 'first'
my $worst = $hit->hsp('worst'); # also last

The SearchIO text BLAST parser also has several options implemented  
for finer control:

     -inclusion_threshold => e-value threshold for inclusion in the
                             PSI-BLAST score matrix model (blastpgp)
     -signif      => float or scientific notation number to be used
                     as a P- or Expect value cutoff
     -score       => integer or scientific notation number to be used
                     as a blast score value cutoff
     -bits        => integer or scientific notation number to be used
                     as a bit score value cutoff
     -hit_filter  => reference to a function to be used for
                     filtering hits based on arbitrary criteria.
                     All hits of each BLAST report must satisfy
                     this criteria to be retained.
                     If a hit fails this test, it is ignored.
                     This function should take a
                     Bio::Search::Hit::BlastHit.pm object as its first
                     argument and return true
                     if the hit should be retained.
                     Sample filter function:
                        -hit_filter => sub { $hit = shift;
                                             $hit->gaps == 0; },
                     (Note: -filt_func is synonymous with -hit_filter)
     -overlap     => integer. The amount of overlap to permit between
                     adjacent HSPs when tiling HSPs. A reasonable  
value is 2.
                     Default = $Bio::SearchIO::blast::MAX_HSP_OVERLAP.

chris

On Nov 27, 2007, at 1:31 PM, Smithies, Russell wrote:

> Do the hits need to be sorted first or is this done automagicly?
> I ask this as I know Megablast doesn't provide sorted output for  
> most of
> it's formats.
>
> Russell
>
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org
> [mailto:bioperl-l-bounces at lists.open-
>> bio.org] On Behalf Of Dave Messina
>> Sent: Wednesday, 28 November 2007 6:56 a.m.
>> To: alison waller
>> Cc: bioperl-l at lists.open-bio.org
>> Subject: Re: [Bioperl-l] help using SEARCH IO to parse blast results
>>
>> Hi Alison,
>> As Sendu mentioned, the key bit is adding a condition to the hit loop
> to
>> limit the number of hits that are printed. I didn't test the below
>> extensively, but give it a try...
>>
>>
>> Dave
>>
>>
>>
>> #!/usr/local/bin/perl -w
>>
>> # Parsing BLAST reports with BioPerl's Bio::SearchIO module
>> # alison waller November 2007
>>
>> use strict;
>> use warnings;
>> use Bio::SearchIO;
>>
>> my $usage = "to run type: blast_parse_aw.pl <blast report> <# of
> hits>\n";
>> if (@ARGV != 2) { die $usage; }
>>
>> my $infile  = $ARGV[0];
>> my $outfile = $infile . '.parsed';
>> my $tophit  = $ARGV[1]; # to specify in the command line how many  
>> hits
>>                        # to report for each query
>>
>> #open(  IN, $infile )     || die "Can't open inputfile $infile! $! 
>> \n";
>> open( OUT, ">$outfile" ) || die "Can't open outputfile $outfile!
> $!\n";
>>
>> my $report = new Bio::SearchIO(
>>    -file   => "$infile",
>>    -format => "blast"
>> );
>>
>> print OUT
>>
> "Query\tHitDesc\tHitSignif\tHitBits\tEvalue\t%id\tAlignLen\tNumIdent 
> \tga
> ps\t
>> Qstrand\tHstrand\n";
>>
>> # Go through BLAST reports one by one
>> while ( my $result = $report->next_result ) {
>>    my $i = 0;
>>    while (  ( $i++ < $tophit ) && (my $hit = $result->next_hit) ) {
>>        while ( my $hsp = $hit->next_hsp ) {
>>
>>            # Print some tab-delimited data about this hit
>>            print OUT $result->query_name,     "\t";
>>            print OUT $hit->name,              "\t";
>>            print OUT $hit->significance,      "\t";
>>            print OUT $hit->bits,              "\t";
>>            print OUT $hsp->evalue,            "\t";
>>            print OUT $hsp->percent_identity,  "\t";
>>            print OUT $hsp->length('total'),   "\t";
>>            print OUT $hsp->num_identical,     "\t";
>>            print OUT $hsp->gaps,              "\t";
>>            print OUT $hsp->strand('query'),   "\t";
>>            print OUT $hsp->strand('hit'),     "\n";
>>        }
>>    }
>>
>>    if ($i == 0) { print OUT "no hits\n"; }
>> }
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> = 
> ======================================================================
> Attention: The information contained in this message and/or  
> attachments
> from AgResearch Limited is intended only for the persons or entities
> to which it is addressed and may contain confidential and/or  
> privileged
> material. Any review, retransmission, dissemination or other use of,  
> or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipients is prohibited by  
> AgResearch
> Limited. If you have received this message in error, please notify the
> sender immediately.
> = 
> ======================================================================
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign







More information about the Bioperl-l mailing list