[Bioperl-l] Problems parsing blast reports

Lorenzo Carretero lcpaulet at googlemail.com
Mon May 16 19:56:29 UTC 2011


Hi Dave,
The following code do the job as far as I don't loop through hsps. I tested
it using some results showing several hsps and gives the expected results.
Hsps for a given hit are also ordered by score so is it really necessary to
loop?.
Best,
Lorenzo
PS: Sorry, I forgot to reply all in my last message

while( my $result = $in->next_result)
>       {
>       $total++;
>       my $query = $result->query_name();
>           while (my $hit = $result->next_hit)
>           {
>               my $hitname = $hit->name();
>             my $bits = $hit->bits();
>             my $evalue = $hit->significance();
>                 if ( $result->num_hits == 1 and $query eq $hitname )
>                 {
>                 $cs++;
>                 $singletons{ $cs } = "$query;$hitname";
>                 last;
>                 }
>                 if ( $query ne $hitname and $bits >= $minimumbitsscore and
> $evalue <= $thresholdevalue )
>                 {
>                 $cb++;
>                 $besthits{ $cb } = "$query;$hitname";
>                 last;
>                 }
>                 elsif    ( $query ne $hitname and $bits <=
> $minimumbitsscore and $evalue >= $thresholdevalue)
>                 {
>                 $cd++;
>                 $diverged { $cd } = "$query;$hitname";
>                 last;
>                 }
> #            while( my $hsp = $hit->next_hsp )    {
> #            #last;
> #            }
>         }
>     }
>
> On Mon, May 16, 2011 at 2:40 PM, Dave Messina <David.Messina at sbc.su.se>wrote:

> Hi Lorenzo,
>
> Please remember to reply all so the mailing list sees the whole discussion.
>
> I looked at your code and used your data, and the reason you're not always
> getting the best hit in your output is that subsequent hits which also pass
> your filter are writing over the best hit in your hash. So it's not a random
> hit that's getting saved, it's the last one that matched your criteria.
>
> To fix this, you need to short-circuit out of the loop once you've found a
> hit matching your criteria.
>
> Label the next_result loop like:
>
>     RESULT: while ( my $result = $in->next_result ) {
>
> and then have the last line inside your filtering if block be
>
>     next RESULT;
>
> Assuming "best hit" to you means best e-value, the above works since hits
> are sorted by best to worst e-value.
>
>
> I'd also like to get the queries only returning itself as hits (singletons)
>> but I must make run besthits properly before (any suggestion?).
>
>
> For this, I'd suggest that you actually save the best hit regardless of
> whether it's a self-match, and then if there's a hit passing your criteria
> that isn't a self-match, you test for that and replace the previously saved
> self-match in %besthits. (Again, short-circuiting out of the loop once
> you've got that hit.)
>
>
> For the moment I'm only testing bitscore and thresholdevalue so I don't
>> need to loop over hsp (do I?).
>
>
> No, remember each line in the -m8 output is an hsp, not a hit. So if you
> only loop over the hits, you'll only see the first hsp. In your example
> dataset, there weren't any with multiple hsps, so that's why you didn't run
> into this problem.
>
>
> Dave
>
>
>
>
> On Sat, May 14, 2011 at 22:17, Lorenzo Carretero <lcpaulet at googlemail.com>wrote:
>
>> Hi David,
>> Thanks for your reply.
>> My scripts is within a subroutine within a .pm file called from a .pl
>> script. I checked the arguments are being passed correctly. For the moment
>> I'm only testing bitscore and thresholdevalue so I don't need to loop over
>> hsp (do I?). Of course, I'm using the strict and warning pragmas and use
>> Bio::SearchIO; I tried with a reduced version of my BLASTP output (attached
>> as a file). This is what I get in %besthits:
>>
>>> gnl|Alyrata|AL2G21220,gnl|Alyrata|AL6G21100
>>> gnl|Alyrata|AL0G11620,gnl|Alyrata|AL1G37580
>>> gnl|Alyrata|AL6G05070,gnl|Alyrata|AL3G15690
>>> gnl|Alyrata|AL2G12090,gnl|Alyrata|AL4G18800
>>> gnl|Alyrata|AL1G15460,gnl|Alyrata|AL0G01870
>>>
>>
>> My script only returns the correct hits for AL6G05070 and AL0G11260, while
>> the 3d for AL2G21220, the 17th for AL2G12090 and the 13th for AL1G15460.
>> Note that the best non-self hit for AL2G12090 is not the second one but the
>> first as it shows better score than the self-hit. I'd also like to get the
>> queries only returning itself as hits (singletons) but I must make run
>> besthits properly before (any suggestion?).
>> Thanks again,
>> Lorenzo
>>
>> On Sat, May 14, 2011 at 6:15 PM, Dave Messina <David.Messina at sbc.su.se>wrote:
>>
>>> Hi Lorenzo,
>>>
>>> Hmm, I tried your code, and it worked for me.
>>>
>>> I set your input variables set as follows:
>>> my ( $filename, $format, $minimumbitsscore, $thresholdevalue,
>>> $minimumidentity,
>>>     $maximumredundancy, $minimumalnlength )
>>>   = ('2008.blasttable', 'blasttable', 300, '1e-100', 10, 0, 10);
>>>
>>> where the file is t/data/2008.blasttable that comes with the BioPerl
>>> distro. With those settings, the top hit goes into %besthits and the third
>>> hit goes into %diverged.
>>>
>>> I also added at the top:
>>> use strict;
>>> use warnings;
>>> use Bio::SearchIO;
>>>
>>> You *are* using strict and warnings, aren't you? :)
>>>
>>> One thing that may be an issue is that you're doing this at the hit
>>> level, and remember that in -m8 format each line represents an HSP.
>>>
>>> I'd need your input — really, a neat little test case and the
>>> corresponding erroneous output — to help further.
>>>
>>> Dave
>>>
>>>
>>>
>>> On Sat, May 14, 2011 at 17:29, Lorenzo Carretero <
>>> lcpaulet at googlemail.com> wrote:
>>>
>>>> Hi all,
>>>> I'm trying to parse blasttable '-m 8 format' reports from whole
>>>> intra-genome
>>>> comparisons of all vs all to get the best non-self hit and diverged best
>>>> non-self hit as separate hashes (key=query=>value=non_self_hit). The
>>>> following script seems to run ok but it returns unexpected results (i.e.
>>>> it
>>>> doesn't catch the best hit but a random hit apparently). I assume hits
>>>> are
>>>> iterated over the while loop (while (my $hit = $result->next_hit) as
>>>> returned in the blast results (i.e. sorted by hits bit scores).
>>>> Any help would be much appreciated.
>>>> Cheers,
>>>> Lorenzo
>>>>
>>>>
>>>> my ( $filename, $format, $minimumbitsscore, $thresholdevalue,
>>>> > $minimumidentity, $maximumredundancy, $minimumalnlength ) = @_;
>>>> > my ( %besthits, %diverged, %redundant ,%genefusion ) = ();
>>>> > my ( $refbh, $refdiv, $cb, $cd, $refred, $reffus, $cr, $cf);
>>>> > my $total = 0;
>>>> > my $in = new Bio::SearchIO (     -file   => $filename,
>>>> >                                  -format => $format,
>>>> >                                  #-verbose => -1,
>>>> >                                   )
>>>> >                                  or die "No $filename BLAST file with
>>>> > $format found";
>>>> >       while( my $result = $in->next_result)    {
>>>> >       $total++;
>>>> >           while (my $hit = $result->next_hit)    {
>>>> >           my $query = $result->query_name();
>>>> >           my $hitname = $hit->name();
>>>> >           my $bits = $hit->bits();
>>>> >           my $evalue = $hit->significance();
>>>> >             if ($query ne $hitname and $bits >= $minimumbitsscore and
>>>> > $evalue <= $thresholdevalue )    {
>>>> >                 $besthits{ $query } = $hitname;
>>>> >             }
>>>> >             elsif    ( $bits <= $minimumbitsscore and $evalue >=
>>>> > $thresholdevalue){
>>>> >                 $diverged { $query } = $hitname;
>>>> >             }
>>>> >
>>>> > #                while( my $hsp = $hit->next_hsp )    {
>>>> > #                my $querylen = $hsp->length( 'query' );
>>>> > #                my $hitlen = $hsp->length( 'hit' );
>>>> > #                my $alnlen = $hsp->length( 'total' );
>>>> > #                my $identity = $hsp->percent_identity();
>>>> > #                    if ( $identity > $maximumredundancy )    {
>>>> > #                    $redundant{ $query } = $hitname;
>>>> > #                    }
>>>> > #                    elsif ( ($querylen <= ($alnlen / 2) ) || ($hitlen
>>>> <=
>>>> > ($alnlen / 2) ) || ($alnlen <= $minimumalnlength) )    {
>>>> > #                    $genefusion{ $query } = $hitname;
>>>> > #                    }
>>>> > #                    elsif ( ($evalue >= $thresholdevalue) ||
>>>> ($identity <=
>>>> > $minimumidentity) || ($bits <= $minimumbitsscore) )    {
>>>> > #                    $diverged{ $query } = $hitname;
>>>> > #                    }
>>>> > #                    else    {
>>>> > #                    $besthits{ $query } = $hitname;
>>>> > #                    }
>>>> > #                }
>>>> >             }
>>>> >     }
>>>> > $refbh = \%besthits;
>>>> > $refdiv = \%diverged;
>>>> > $refred = \%redundant;
>>>> > $reffus = \%genefusion;
>>>> > $cb = keys ( %besthits );
>>>> > $cd = keys( %diverged );
>>>> > $cr = keys( %redundant );
>>>> > $cf = keys( %genefusion );
>>>> >
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>
>>>
>>
>




More information about the Bioperl-l mailing list