[Bioperl-l] Problems parsing blast reports

Dave Messina David.Messina at sbc.su.se
Mon May 16 20:14:04 UTC 2011


Hi Lorenzo,


 Hsps for a given hit are also ordered by score so is it really necessary to
> loop?.
>

If you know you'll always be interested in the highest-scoring hsp for a
hit, and if you're not using anything in the hsp object. then yes, it should
always come first and I don't think you'll need the hsp loop.

Glad to hear you've got it working!

Best,
Dave






> Best,
> Lorenzo
> PS: Sorry, I forgot to reply all in my last message
>
> while( my $result = $in->next_result)
>>       {
>>       $total++;
>>       my $query = $result->query_name();
>>           while (my $hit = $result->next_hit)
>>           {
>>               my $hitname = $hit->name();
>>             my $bits = $hit->bits();
>>             my $evalue = $hit->significance();
>>                 if ( $result->num_hits == 1 and $query eq $hitname )
>>                 {
>>                 $cs++;
>>                 $singletons{ $cs } = "$query;$hitname";
>>                 last;
>>
>>                 }
>>                 if ( $query ne $hitname and $bits >= $minimumbitsscore and
>> $evalue <= $thresholdevalue )
>>                 {
>>                 $cb++;
>>                 $besthits{ $cb } = "$query;$hitname";
>>                 last;
>>                 }
>>                 elsif    ( $query ne $hitname and $bits <=
>> $minimumbitsscore and $evalue >= $thresholdevalue)
>>                 {
>>                 $cd++;
>>                 $diverged { $cd } = "$query;$hitname";
>>                 last;
>>                 }
>> #            while( my $hsp = $hit->next_hsp )    {
>> #            #last;
>> #            }
>>         }
>>     }
>>
>> On Mon, May 16, 2011 at 2:40 PM, Dave Messina <David.Messina at sbc.su.se>wrote:
>
>> Hi Lorenzo,
>>
>> Please remember to reply all so the mailing list sees the whole
>> discussion.
>>
>> I looked at your code and used your data, and the reason you're not always
>> getting the best hit in your output is that subsequent hits which also pass
>> your filter are writing over the best hit in your hash. So it's not a random
>> hit that's getting saved, it's the last one that matched your criteria.
>>
>> To fix this, you need to short-circuit out of the loop once you've found a
>> hit matching your criteria.
>>
>> Label the next_result loop like:
>>
>>     RESULT: while ( my $result = $in->next_result ) {
>>
>> and then have the last line inside your filtering if block be
>>
>>     next RESULT;
>>
>> Assuming "best hit" to you means best e-value, the above works since hits
>> are sorted by best to worst e-value.
>>
>>
>> I'd also like to get the queries only returning itself as hits
>>> (singletons) but I must make run besthits properly before (any suggestion?).
>>
>>
>> For this, I'd suggest that you actually save the best hit regardless of
>> whether it's a self-match, and then if there's a hit passing your criteria
>> that isn't a self-match, you test for that and replace the previously saved
>> self-match in %besthits. (Again, short-circuiting out of the loop once
>> you've got that hit.)
>>
>>
>> For the moment I'm only testing bitscore and thresholdevalue so I don't
>>> need to loop over hsp (do I?).
>>
>>
>> No, remember each line in the -m8 output is an hsp, not a hit. So if you
>> only loop over the hits, you'll only see the first hsp. In your example
>> dataset, there weren't any with multiple hsps, so that's why you didn't run
>> into this problem.
>>
>>
>> Dave
>>
>>
>>
>>
>> On Sat, May 14, 2011 at 22:17, Lorenzo Carretero <lcpaulet at googlemail.com
>> > wrote:
>>
>>> Hi David,
>>> Thanks for your reply.
>>> My scripts is within a subroutine within a .pm file called from a .pl
>>> script. I checked the arguments are being passed correctly. For the moment
>>> I'm only testing bitscore and thresholdevalue so I don't need to loop over
>>> hsp (do I?). Of course, I'm using the strict and warning pragmas and use
>>> Bio::SearchIO; I tried with a reduced version of my BLASTP output (attached
>>> as a file). This is what I get in %besthits:
>>>
>>>> gnl|Alyrata|AL2G21220,gnl|Alyrata|AL6G21100
>>>> gnl|Alyrata|AL0G11620,gnl|Alyrata|AL1G37580
>>>> gnl|Alyrata|AL6G05070,gnl|Alyrata|AL3G15690
>>>> gnl|Alyrata|AL2G12090,gnl|Alyrata|AL4G18800
>>>> gnl|Alyrata|AL1G15460,gnl|Alyrata|AL0G01870
>>>>
>>>
>>> My script only returns the correct hits for AL6G05070 and AL0G11260,
>>> while the 3d for AL2G21220, the 17th for AL2G12090 and the 13th for
>>> AL1G15460. Note that the best non-self hit for AL2G12090 is not the second
>>> one but the first as it shows better score than the self-hit. I'd also like
>>> to get the queries only returning itself as hits (singletons) but I must
>>> make run besthits properly before (any suggestion?).
>>> Thanks again,
>>> Lorenzo
>>>
>>> On Sat, May 14, 2011 at 6:15 PM, Dave Messina <David.Messina at sbc.su.se>wrote:
>>>
>>>> Hi Lorenzo,
>>>>
>>>> Hmm, I tried your code, and it worked for me.
>>>>
>>>> I set your input variables set as follows:
>>>> my ( $filename, $format, $minimumbitsscore, $thresholdevalue,
>>>> $minimumidentity,
>>>>     $maximumredundancy, $minimumalnlength )
>>>>   = ('2008.blasttable', 'blasttable', 300, '1e-100', 10, 0, 10);
>>>>
>>>> where the file is t/data/2008.blasttable that comes with the BioPerl
>>>> distro. With those settings, the top hit goes into %besthits and the third
>>>> hit goes into %diverged.
>>>>
>>>> I also added at the top:
>>>> use strict;
>>>> use warnings;
>>>> use Bio::SearchIO;
>>>>
>>>> You *are* using strict and warnings, aren't you? :)
>>>>
>>>> One thing that may be an issue is that you're doing this at the hit
>>>> level, and remember that in -m8 format each line represents an HSP.
>>>>
>>>> I'd need your input — really, a neat little test case and the
>>>> corresponding erroneous output — to help further.
>>>>
>>>> Dave
>>>>
>>>>
>>>>
>>>> On Sat, May 14, 2011 at 17:29, Lorenzo Carretero <
>>>> lcpaulet at googlemail.com> wrote:
>>>>
>>>>> Hi all,
>>>>> I'm trying to parse blasttable '-m 8 format' reports from whole
>>>>> intra-genome
>>>>> comparisons of all vs all to get the best non-self hit and diverged
>>>>> best
>>>>> non-self hit as separate hashes (key=query=>value=non_self_hit). The
>>>>> following script seems to run ok but it returns unexpected results
>>>>> (i.e. it
>>>>> doesn't catch the best hit but a random hit apparently). I assume hits
>>>>> are
>>>>> iterated over the while loop (while (my $hit = $result->next_hit) as
>>>>> returned in the blast results (i.e. sorted by hits bit scores).
>>>>> Any help would be much appreciated.
>>>>> Cheers,
>>>>> Lorenzo
>>>>>
>>>>>
>>>>> my ( $filename, $format, $minimumbitsscore, $thresholdevalue,
>>>>> > $minimumidentity, $maximumredundancy, $minimumalnlength ) = @_;
>>>>> > my ( %besthits, %diverged, %redundant ,%genefusion ) = ();
>>>>> > my ( $refbh, $refdiv, $cb, $cd, $refred, $reffus, $cr, $cf);
>>>>> > my $total = 0;
>>>>> > my $in = new Bio::SearchIO (     -file   => $filename,
>>>>> >                                  -format => $format,
>>>>> >                                  #-verbose => -1,
>>>>> >                                   )
>>>>> >                                  or die "No $filename BLAST file with
>>>>> > $format found";
>>>>> >       while( my $result = $in->next_result)    {
>>>>> >       $total++;
>>>>> >           while (my $hit = $result->next_hit)    {
>>>>> >           my $query = $result->query_name();
>>>>> >           my $hitname = $hit->name();
>>>>> >           my $bits = $hit->bits();
>>>>> >           my $evalue = $hit->significance();
>>>>> >             if ($query ne $hitname and $bits >= $minimumbitsscore and
>>>>> > $evalue <= $thresholdevalue )    {
>>>>> >                 $besthits{ $query } = $hitname;
>>>>> >             }
>>>>> >             elsif    ( $bits <= $minimumbitsscore and $evalue >=
>>>>> > $thresholdevalue){
>>>>> >                 $diverged { $query } = $hitname;
>>>>> >             }
>>>>> >
>>>>> > #                while( my $hsp = $hit->next_hsp )    {
>>>>> > #                my $querylen = $hsp->length( 'query' );
>>>>> > #                my $hitlen = $hsp->length( 'hit' );
>>>>> > #                my $alnlen = $hsp->length( 'total' );
>>>>> > #                my $identity = $hsp->percent_identity();
>>>>> > #                    if ( $identity > $maximumredundancy )    {
>>>>> > #                    $redundant{ $query } = $hitname;
>>>>> > #                    }
>>>>> > #                    elsif ( ($querylen <= ($alnlen / 2) ) ||
>>>>> ($hitlen <=
>>>>> > ($alnlen / 2) ) || ($alnlen <= $minimumalnlength) )    {
>>>>> > #                    $genefusion{ $query } = $hitname;
>>>>> > #                    }
>>>>> > #                    elsif ( ($evalue >= $thresholdevalue) ||
>>>>> ($identity <=
>>>>> > $minimumidentity) || ($bits <= $minimumbitsscore) )    {
>>>>> > #                    $diverged{ $query } = $hitname;
>>>>> > #                    }
>>>>> > #                    else    {
>>>>> > #                    $besthits{ $query } = $hitname;
>>>>> > #                    }
>>>>> > #                }
>>>>> >             }
>>>>> >     }
>>>>> > $refbh = \%besthits;
>>>>> > $refdiv = \%diverged;
>>>>> > $refred = \%redundant;
>>>>> > $reffus = \%genefusion;
>>>>> > $cb = keys ( %besthits );
>>>>> > $cd = keys( %diverged );
>>>>> > $cr = keys( %redundant );
>>>>> > $cf = keys( %genefusion );
>>>>> >
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>
>>>>
>>>>
>>>
>>
>




More information about the Bioperl-l mailing list