[Bioperl-l] help using SEARCH IO to parse blast results

Chris Fields cjfields at uiuc.edu
Wed Nov 28 13:57:27 UTC 2007


I had some code which does this which I committed yesterday to CVS; it  
catches the GI for the query and the hits:

$result->query_gi;
$hit->ncbi_gi;

I am in the midst of fixing additional problems with WU-BLAST parsing  
but you are more than welcome to try it.

chris

On Nov 27, 2007, at 3:32 PM, alison waller wrote:

> Thanks Everyone,
>
> Your edits worked Dave, however after looking at the output I  
> realized that
> I only want information on the top hsp per query returned.  For  
> example some
> of the querys the top hit has two hsps so it returned both.
>
> I tried to further edit it, but after 3 attempts they are all  
> failing, I
> think due to me using the loops wrong.
>
> I also have another problem, I also want to retrieve the gi, this  
> doesn't
> seem to be straight forward as it should.  I found another method
> _get_seq_identifiers, but this looks awkward, isn't there and object  
> for the
> gi?
>
> I've pasted my non-working script below if there are any suggestions  
> on how
> to get it to print out just the first hsp per hit, that would be  
> great.
>
> Thanks,
>
>
> #!/usr/local/bin/perl -w
>
>
> # Parsing BLAST reports with BioPerl's Bio::SearchIO module
> # alison waller November 2007
>
>
> use strict;
> use warnings;
> use Bio::SearchIO;
>
>
> my $usage = "to run type: blast_parse_aw.pl <blast report> <# of  
> hits>\n";
> if (@ARGV != 2) { die $usage; }
>
>
> my $infile  = $ARGV[0];
> my $outfile = $infile . '.parsed';
> my $tophit  = $ARGV[1]; # to specify in the command line how many hits
>                        # to report for each query
>
>
> #open(  IN, $infile )     || die "Can't open inputfile $infile! $!\n";
> open( OUT, ">$outfile" ) || die "Can't open outputfile $outfile! $! 
> \n";
>
>
> my $report = new Bio::SearchIO(
>    -file   => "$infile",
>    -format => "blast"
> );
>
>
> print OUT
>
> "Query\tHitDesc\tHitSignif\tHitBits\tEvalue\t%id\tAlignLen\tNumIdent 
> \tgaps\t
> strand\tHstrand\n";
>
>
> # Go through BLAST reports one by one
> while (my $result = $report->next_result) {
> 	my $i=0;
> 	while( ( $i++<$tophit) && (my $hit = $result->next_hit)){
>    	while (  ( $i++ < $tophit ) && (my $hsp = $hit->next_hsp) ) {
>
>
>            # Print some tab-delimited data about this hit
>            print OUT $result->query_name,     "\t";
>            print OUT $hit->name,              "\t";
>            print OUT $hit->significance,      "\t";
>            print OUT $hit->bits,              "\t";
>            print OUT $hsp->evalue,            "\t";
>            print OUT $hsp->percent_identity,  "\t";
>            print OUT $hsp->length('total'),   "\t";
>            print OUT $hsp->num_identical,     "\t";
>            print OUT $hsp->gaps,              "\t";
>            print OUT $hsp->strand('query'),   "\t";
>            print OUT $hsp->strand('hit'),     "\n";
>        }
> }
>    if ($i == 0) { print OUT "no hits\n"; }
>
> }
>
> -----Original Message-----
> From: Chris Fields [mailto:cjfields at uiuc.edu]
> Sent: Tuesday, November 27, 2007 4:01 PM
> To: Smithies, Russell
> Cc: Dave Messina; alison waller; bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] help using SEARCH IO to parse blast results
>
> The hits/HSPs are generally in the order they appear in the report.
>
> If you are looking for best/worst HSP after parsing you can use the
> $hit->hsp() method:
>
> # best and worst
> my $best = $hit->hsp('best'); # also 'first'
> my $worst = $hit->hsp('worst'); # also last
>
> The SearchIO text BLAST parser also has several options implemented
> for finer control:
>
>     -inclusion_threshold => e-value threshold for inclusion in the
>                             PSI-BLAST score matrix model (blastpgp)
>     -signif      => float or scientific notation number to be used
>                     as a P- or Expect value cutoff
>     -score       => integer or scientific notation number to be used
>                     as a blast score value cutoff
>     -bits        => integer or scientific notation number to be used
>                     as a bit score value cutoff
>     -hit_filter  => reference to a function to be used for
>                     filtering hits based on arbitrary criteria.
>                     All hits of each BLAST report must satisfy
>                     this criteria to be retained.
>                     If a hit fails this test, it is ignored.
>                     This function should take a
>                     Bio::Search::Hit::BlastHit.pm object as its first
>                     argument and return true
>                     if the hit should be retained.
>                     Sample filter function:
>                        -hit_filter => sub { $hit = shift;
>                                             $hit->gaps == 0; },
>                     (Note: -filt_func is synonymous with -hit_filter)
>     -overlap     => integer. The amount of overlap to permit between
>                     adjacent HSPs when tiling HSPs. A reasonable
> value is 2.
>                     Default = $Bio::SearchIO::blast::MAX_HSP_OVERLAP.
>
> chris
>
> On Nov 27, 2007, at 1:31 PM, Smithies, Russell wrote:
>
>> Do the hits need to be sorted first or is this done automagicly?
>> I ask this as I know Megablast doesn't provide sorted output for
>> most of
>> it's formats.
>>
>> Russell
>>
>>> -----Original Message-----
>>> From: bioperl-l-bounces at lists.open-bio.org
>> [mailto:bioperl-l-bounces at lists.open-
>>> bio.org] On Behalf Of Dave Messina
>>> Sent: Wednesday, 28 November 2007 6:56 a.m.
>>> To: alison waller
>>> Cc: bioperl-l at lists.open-bio.org
>>> Subject: Re: [Bioperl-l] help using SEARCH IO to parse blast results
>>>
>>> Hi Alison,
>>> As Sendu mentioned, the key bit is adding a condition to the hit  
>>> loop
>> to
>>> limit the number of hits that are printed. I didn't test the below
>>> extensively, but give it a try...
>>>
>>>
>>> Dave
>>>
>>>
>>>
>>> #!/usr/local/bin/perl -w
>>>
>>> # Parsing BLAST reports with BioPerl's Bio::SearchIO module
>>> # alison waller November 2007
>>>
>>> use strict;
>>> use warnings;
>>> use Bio::SearchIO;
>>>
>>> my $usage = "to run type: blast_parse_aw.pl <blast report> <# of
>> hits>\n";
>>> if (@ARGV != 2) { die $usage; }
>>>
>>> my $infile  = $ARGV[0];
>>> my $outfile = $infile . '.parsed';
>>> my $tophit  = $ARGV[1]; # to specify in the command line how many
>>> hits
>>>                       # to report for each query
>>>
>>> #open(  IN, $infile )     || die "Can't open inputfile $infile! $!
>>> \n";
>>> open( OUT, ">$outfile" ) || die "Can't open outputfile $outfile!
>> $!\n";
>>>
>>> my $report = new Bio::SearchIO(
>>>   -file   => "$infile",
>>>   -format => "blast"
>>> );
>>>
>>> print OUT
>>>
>> "Query\tHitDesc\tHitSignif\tHitBits\tEvalue\t%id\tAlignLen\tNumIdent
>> \tga
>> ps\t
>>> Qstrand\tHstrand\n";
>>>
>>> # Go through BLAST reports one by one
>>> while ( my $result = $report->next_result ) {
>>>   my $i = 0;
>>>   while (  ( $i++ < $tophit ) && (my $hit = $result->next_hit) ) {
>>>       while ( my $hsp = $hit->next_hsp ) {
>>>
>>>           # Print some tab-delimited data about this hit
>>>           print OUT $result->query_name,     "\t";
>>>           print OUT $hit->name,              "\t";
>>>           print OUT $hit->significance,      "\t";
>>>           print OUT $hit->bits,              "\t";
>>>           print OUT $hsp->evalue,            "\t";
>>>           print OUT $hsp->percent_identity,  "\t";
>>>           print OUT $hsp->length('total'),   "\t";
>>>           print OUT $hsp->num_identical,     "\t";
>>>           print OUT $hsp->gaps,              "\t";
>>>           print OUT $hsp->strand('query'),   "\t";
>>>           print OUT $hsp->strand('hit'),     "\n";
>>>       }
>>>   }
>>>
>>>   if ($i == 0) { print OUT "no hits\n"; }
>>> }
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> =
>> = 
>> =====================================================================
>> Attention: The information contained in this message and/or
>> attachments
>> from AgResearch Limited is intended only for the persons or entities
>> to which it is addressed and may contain confidential and/or
>> privileged
>> material. Any review, retransmission, dissemination or other use of,
>> or
>> taking of any action in reliance upon, this information by persons or
>> entities other than the intended recipients is prohibited by
>> AgResearch
>> Limited. If you have received this message in error, please notify  
>> the
>> sender immediately.
>> =
>> = 
>> =====================================================================
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Robert Switzer
> Dept of Biochemistry
> University of Illinois Urbana-Champaign
>
>
>
>

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign






More information about the Bioperl-l mailing list