[Bioperl-l] Problems parsing blast reports

Dave Messina David.Messina at sbc.su.se
Mon May 16 12:40:57 UTC 2011


Hi Lorenzo,

Please remember to reply all so the mailing list sees the whole discussion.

I looked at your code and used your data, and the reason you're not always
getting the best hit in your output is that subsequent hits which also pass
your filter are writing over the best hit in your hash. So it's not a random
hit that's getting saved, it's the last one that matched your criteria.

To fix this, you need to short-circuit out of the loop once you've found a
hit matching your criteria.

Label the next_result loop like:

    RESULT: while ( my $result = $in->next_result ) {

and then have the last line inside your filtering if block be

    next RESULT;

Assuming "best hit" to you means best e-value, the above works since hits
are sorted by best to worst e-value.


I'd also like to get the queries only returning itself as hits (singletons)
> but I must make run besthits properly before (any suggestion?).


For this, I'd suggest that you actually save the best hit regardless of
whether it's a self-match, and then if there's a hit passing your criteria
that isn't a self-match, you test for that and replace the previously saved
self-match in %besthits. (Again, short-circuiting out of the loop once
you've got that hit.)


For the moment I'm only testing bitscore and thresholdevalue so I don't need
> to loop over hsp (do I?).


No, remember each line in the -m8 output is an hsp, not a hit. So if you
only loop over the hits, you'll only see the first hsp. In your example
dataset, there weren't any with multiple hsps, so that's why you didn't run
into this problem.


Dave




On Sat, May 14, 2011 at 22:17, Lorenzo Carretero <lcpaulet at googlemail.com>wrote:

> Hi David,
> Thanks for your reply.
> My scripts is within a subroutine within a .pm file called from a .pl
> script. I checked the arguments are being passed correctly. For the moment
> I'm only testing bitscore and thresholdevalue so I don't need to loop over
> hsp (do I?). Of course, I'm using the strict and warning pragmas and use
> Bio::SearchIO; I tried with a reduced version of my BLASTP output (attached
> as a file). This is what I get in %besthits:
>
>> gnl|Alyrata|AL2G21220,gnl|Alyrata|AL6G21100
>> gnl|Alyrata|AL0G11620,gnl|Alyrata|AL1G37580
>> gnl|Alyrata|AL6G05070,gnl|Alyrata|AL3G15690
>> gnl|Alyrata|AL2G12090,gnl|Alyrata|AL4G18800
>> gnl|Alyrata|AL1G15460,gnl|Alyrata|AL0G01870
>>
>
> My script only returns the correct hits for AL6G05070 and AL0G11260, while
> the 3d for AL2G21220, the 17th for AL2G12090 and the 13th for AL1G15460.
> Note that the best non-self hit for AL2G12090 is not the second one but the
> first as it shows better score than the self-hit. I'd also like to get the
> queries only returning itself as hits (singletons) but I must make run
> besthits properly before (any suggestion?).
> Thanks again,
> Lorenzo
>
> On Sat, May 14, 2011 at 6:15 PM, Dave Messina <David.Messina at sbc.su.se>wrote:
>
>> Hi Lorenzo,
>>
>> Hmm, I tried your code, and it worked for me.
>>
>> I set your input variables set as follows:
>> my ( $filename, $format, $minimumbitsscore, $thresholdevalue,
>> $minimumidentity,
>>     $maximumredundancy, $minimumalnlength )
>>   = ('2008.blasttable', 'blasttable', 300, '1e-100', 10, 0, 10);
>>
>> where the file is t/data/2008.blasttable that comes with the BioPerl
>> distro. With those settings, the top hit goes into %besthits and the third
>> hit goes into %diverged.
>>
>> I also added at the top:
>> use strict;
>> use warnings;
>> use Bio::SearchIO;
>>
>> You *are* using strict and warnings, aren't you? :)
>>
>> One thing that may be an issue is that you're doing this at the hit level,
>> and remember that in -m8 format each line represents an HSP.
>>
>> I'd need your input — really, a neat little test case and the
>> corresponding erroneous output — to help further.
>>
>> Dave
>>
>>
>>
>> On Sat, May 14, 2011 at 17:29, Lorenzo Carretero <lcpaulet at googlemail.com
>> > wrote:
>>
>>> Hi all,
>>> I'm trying to parse blasttable '-m 8 format' reports from whole
>>> intra-genome
>>> comparisons of all vs all to get the best non-self hit and diverged best
>>> non-self hit as separate hashes (key=query=>value=non_self_hit). The
>>> following script seems to run ok but it returns unexpected results (i.e.
>>> it
>>> doesn't catch the best hit but a random hit apparently). I assume hits
>>> are
>>> iterated over the while loop (while (my $hit = $result->next_hit) as
>>> returned in the blast results (i.e. sorted by hits bit scores).
>>> Any help would be much appreciated.
>>> Cheers,
>>> Lorenzo
>>>
>>>
>>> my ( $filename, $format, $minimumbitsscore, $thresholdevalue,
>>> > $minimumidentity, $maximumredundancy, $minimumalnlength ) = @_;
>>> > my ( %besthits, %diverged, %redundant ,%genefusion ) = ();
>>> > my ( $refbh, $refdiv, $cb, $cd, $refred, $reffus, $cr, $cf);
>>> > my $total = 0;
>>> > my $in = new Bio::SearchIO (     -file   => $filename,
>>> >                                  -format => $format,
>>> >                                  #-verbose => -1,
>>> >                                   )
>>> >                                  or die "No $filename BLAST file with
>>> > $format found";
>>> >       while( my $result = $in->next_result)    {
>>> >       $total++;
>>> >           while (my $hit = $result->next_hit)    {
>>> >           my $query = $result->query_name();
>>> >           my $hitname = $hit->name();
>>> >           my $bits = $hit->bits();
>>> >           my $evalue = $hit->significance();
>>> >             if ($query ne $hitname and $bits >= $minimumbitsscore and
>>> > $evalue <= $thresholdevalue )    {
>>> >                 $besthits{ $query } = $hitname;
>>> >             }
>>> >             elsif    ( $bits <= $minimumbitsscore and $evalue >=
>>> > $thresholdevalue){
>>> >                 $diverged { $query } = $hitname;
>>> >             }
>>> >
>>> > #                while( my $hsp = $hit->next_hsp )    {
>>> > #                my $querylen = $hsp->length( 'query' );
>>> > #                my $hitlen = $hsp->length( 'hit' );
>>> > #                my $alnlen = $hsp->length( 'total' );
>>> > #                my $identity = $hsp->percent_identity();
>>> > #                    if ( $identity > $maximumredundancy )    {
>>> > #                    $redundant{ $query } = $hitname;
>>> > #                    }
>>> > #                    elsif ( ($querylen <= ($alnlen / 2) ) || ($hitlen
>>> <=
>>> > ($alnlen / 2) ) || ($alnlen <= $minimumalnlength) )    {
>>> > #                    $genefusion{ $query } = $hitname;
>>> > #                    }
>>> > #                    elsif ( ($evalue >= $thresholdevalue) ||
>>> ($identity <=
>>> > $minimumidentity) || ($bits <= $minimumbitsscore) )    {
>>> > #                    $diverged{ $query } = $hitname;
>>> > #                    }
>>> > #                    else    {
>>> > #                    $besthits{ $query } = $hitname;
>>> > #                    }
>>> > #                }
>>> >             }
>>> >     }
>>> > $refbh = \%besthits;
>>> > $refdiv = \%diverged;
>>> > $refred = \%redundant;
>>> > $reffus = \%genefusion;
>>> > $cb = keys ( %besthits );
>>> > $cd = keys( %diverged );
>>> > $cr = keys( %redundant );
>>> > $cf = keys( %genefusion );
>>> >
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>
>>
>




More information about the Bioperl-l mailing list