[Bioperl-l] Problems parsing blast reports
Chris Fields
cjfields at illinois.edu
Mon May 16 20:58:10 UTC 2011
Agreed, though I would rather screen based on the actual statistic I wanted than assume the first one will always give the best of anything (I'm a bit of a devil's advocate in that regard). Following is some completely untested and probably incorrect code (batteries not included, offer void in all 50 states), but you get the idea:
=================================
use List::Utils qw(reduce);
...
while( my $result = $in->next_result) {
my $best_hit = reduce {$a->significance < $b->significance ? $a : $b } $result->hits;
if (defined($best_hit)) {
my $best_hsp = reduce {$a->evalue < $b->evalue ? $a : $b } $best_hit->hsps;
# do earth-shattering stuff here
}
}
=================================
chris
On May 16, 2011, at 3:14 PM, Dave Messina wrote:
> Hi Lorenzo,
>
>
> Hsps for a given hit are also ordered by score so is it really necessary to
>> loop?.
>>
>
> If you know you'll always be interested in the highest-scoring hsp for a
> hit, and if you're not using anything in the hsp object. then yes, it should
> always come first and I don't think you'll need the hsp loop.
>
> Glad to hear you've got it working!
>
> Best,
> Dave
>
>
>
>
>
>
>> Best,
>> Lorenzo
>> PS: Sorry, I forgot to reply all in my last message
>>
>> while( my $result = $in->next_result)
>>> {
>>> $total++;
>>> my $query = $result->query_name();
>>> while (my $hit = $result->next_hit)
>>> {
>>> my $hitname = $hit->name();
>>> my $bits = $hit->bits();
>>> my $evalue = $hit->significance();
>>> if ( $result->num_hits == 1 and $query eq $hitname )
>>> {
>>> $cs++;
>>> $singletons{ $cs } = "$query;$hitname";
>>> last;
>>>
>>> }
>>> if ( $query ne $hitname and $bits >= $minimumbitsscore and
>>> $evalue <= $thresholdevalue )
>>> {
>>> $cb++;
>>> $besthits{ $cb } = "$query;$hitname";
>>> last;
>>> }
>>> elsif ( $query ne $hitname and $bits <=
>>> $minimumbitsscore and $evalue >= $thresholdevalue)
>>> {
>>> $cd++;
>>> $diverged { $cd } = "$query;$hitname";
>>> last;
>>> }
>>> # while( my $hsp = $hit->next_hsp ) {
>>> # #last;
>>> # }
>>> }
>>> }
>>>
>>> On Mon, May 16, 2011 at 2:40 PM, Dave Messina <David.Messina at sbc.su.se>wrote:
>>
>>> Hi Lorenzo,
>>>
>>> Please remember to reply all so the mailing list sees the whole
>>> discussion.
>>>
>>> I looked at your code and used your data, and the reason you're not always
>>> getting the best hit in your output is that subsequent hits which also pass
>>> your filter are writing over the best hit in your hash. So it's not a random
>>> hit that's getting saved, it's the last one that matched your criteria.
>>>
>>> To fix this, you need to short-circuit out of the loop once you've found a
>>> hit matching your criteria.
>>>
>>> Label the next_result loop like:
>>>
>>> RESULT: while ( my $result = $in->next_result ) {
>>>
>>> and then have the last line inside your filtering if block be
>>>
>>> next RESULT;
>>>
>>> Assuming "best hit" to you means best e-value, the above works since hits
>>> are sorted by best to worst e-value.
>>>
>>>
>>> I'd also like to get the queries only returning itself as hits
>>>> (singletons) but I must make run besthits properly before (any suggestion?).
>>>
>>>
>>> For this, I'd suggest that you actually save the best hit regardless of
>>> whether it's a self-match, and then if there's a hit passing your criteria
>>> that isn't a self-match, you test for that and replace the previously saved
>>> self-match in %besthits. (Again, short-circuiting out of the loop once
>>> you've got that hit.)
>>>
>>>
>>> For the moment I'm only testing bitscore and thresholdevalue so I don't
>>>> need to loop over hsp (do I?).
>>>
>>>
>>> No, remember each line in the -m8 output is an hsp, not a hit. So if you
>>> only loop over the hits, you'll only see the first hsp. In your example
>>> dataset, there weren't any with multiple hsps, so that's why you didn't run
>>> into this problem.
>>>
>>>
>>> Dave
>>>
>>>
>>>
>>>
>>> On Sat, May 14, 2011 at 22:17, Lorenzo Carretero <lcpaulet at googlemail.com
>>>> wrote:
>>>
>>>> Hi David,
>>>> Thanks for your reply.
>>>> My scripts is within a subroutine within a .pm file called from a .pl
>>>> script. I checked the arguments are being passed correctly. For the moment
>>>> I'm only testing bitscore and thresholdevalue so I don't need to loop over
>>>> hsp (do I?). Of course, I'm using the strict and warning pragmas and use
>>>> Bio::SearchIO; I tried with a reduced version of my BLASTP output (attached
>>>> as a file). This is what I get in %besthits:
>>>>
>>>>> gnl|Alyrata|AL2G21220,gnl|Alyrata|AL6G21100
>>>>> gnl|Alyrata|AL0G11620,gnl|Alyrata|AL1G37580
>>>>> gnl|Alyrata|AL6G05070,gnl|Alyrata|AL3G15690
>>>>> gnl|Alyrata|AL2G12090,gnl|Alyrata|AL4G18800
>>>>> gnl|Alyrata|AL1G15460,gnl|Alyrata|AL0G01870
>>>>>
>>>>
>>>> My script only returns the correct hits for AL6G05070 and AL0G11260,
>>>> while the 3d for AL2G21220, the 17th for AL2G12090 and the 13th for
>>>> AL1G15460. Note that the best non-self hit for AL2G12090 is not the second
>>>> one but the first as it shows better score than the self-hit. I'd also like
>>>> to get the queries only returning itself as hits (singletons) but I must
>>>> make run besthits properly before (any suggestion?).
>>>> Thanks again,
>>>> Lorenzo
>>>>
>>>> On Sat, May 14, 2011 at 6:15 PM, Dave Messina <David.Messina at sbc.su.se>wrote:
>>>>
>>>>> Hi Lorenzo,
>>>>>
>>>>> Hmm, I tried your code, and it worked for me.
>>>>>
>>>>> I set your input variables set as follows:
>>>>> my ( $filename, $format, $minimumbitsscore, $thresholdevalue,
>>>>> $minimumidentity,
>>>>> $maximumredundancy, $minimumalnlength )
>>>>> = ('2008.blasttable', 'blasttable', 300, '1e-100', 10, 0, 10);
>>>>>
>>>>> where the file is t/data/2008.blasttable that comes with the BioPerl
>>>>> distro. With those settings, the top hit goes into %besthits and the third
>>>>> hit goes into %diverged.
>>>>>
>>>>> I also added at the top:
>>>>> use strict;
>>>>> use warnings;
>>>>> use Bio::SearchIO;
>>>>>
>>>>> You *are* using strict and warnings, aren't you? :)
>>>>>
>>>>> One thing that may be an issue is that you're doing this at the hit
>>>>> level, and remember that in -m8 format each line represents an HSP.
>>>>>
>>>>> I'd need your input — really, a neat little test case and the
>>>>> corresponding erroneous output — to help further.
>>>>>
>>>>> Dave
>>>>>
>>>>>
>>>>>
>>>>> On Sat, May 14, 2011 at 17:29, Lorenzo Carretero <
>>>>> lcpaulet at googlemail.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>> I'm trying to parse blasttable '-m 8 format' reports from whole
>>>>>> intra-genome
>>>>>> comparisons of all vs all to get the best non-self hit and diverged
>>>>>> best
>>>>>> non-self hit as separate hashes (key=query=>value=non_self_hit). The
>>>>>> following script seems to run ok but it returns unexpected results
>>>>>> (i.e. it
>>>>>> doesn't catch the best hit but a random hit apparently). I assume hits
>>>>>> are
>>>>>> iterated over the while loop (while (my $hit = $result->next_hit) as
>>>>>> returned in the blast results (i.e. sorted by hits bit scores).
>>>>>> Any help would be much appreciated.
>>>>>> Cheers,
>>>>>> Lorenzo
>>>>>>
>>>>>>
>>>>>> my ( $filename, $format, $minimumbitsscore, $thresholdevalue,
>>>>>>> $minimumidentity, $maximumredundancy, $minimumalnlength ) = @_;
>>>>>>> my ( %besthits, %diverged, %redundant ,%genefusion ) = ();
>>>>>>> my ( $refbh, $refdiv, $cb, $cd, $refred, $reffus, $cr, $cf);
>>>>>>> my $total = 0;
>>>>>>> my $in = new Bio::SearchIO ( -file => $filename,
>>>>>>> -format => $format,
>>>>>>> #-verbose => -1,
>>>>>>> )
>>>>>>> or die "No $filename BLAST file with
>>>>>>> $format found";
>>>>>>> while( my $result = $in->next_result) {
>>>>>>> $total++;
>>>>>>> while (my $hit = $result->next_hit) {
>>>>>>> my $query = $result->query_name();
>>>>>>> my $hitname = $hit->name();
>>>>>>> my $bits = $hit->bits();
>>>>>>> my $evalue = $hit->significance();
>>>>>>> if ($query ne $hitname and $bits >= $minimumbitsscore and
>>>>>>> $evalue <= $thresholdevalue ) {
>>>>>>> $besthits{ $query } = $hitname;
>>>>>>> }
>>>>>>> elsif ( $bits <= $minimumbitsscore and $evalue >=
>>>>>>> $thresholdevalue){
>>>>>>> $diverged { $query } = $hitname;
>>>>>>> }
>>>>>>>
>>>>>>> # while( my $hsp = $hit->next_hsp ) {
>>>>>>> # my $querylen = $hsp->length( 'query' );
>>>>>>> # my $hitlen = $hsp->length( 'hit' );
>>>>>>> # my $alnlen = $hsp->length( 'total' );
>>>>>>> # my $identity = $hsp->percent_identity();
>>>>>>> # if ( $identity > $maximumredundancy ) {
>>>>>>> # $redundant{ $query } = $hitname;
>>>>>>> # }
>>>>>>> # elsif ( ($querylen <= ($alnlen / 2) ) ||
>>>>>> ($hitlen <=
>>>>>>> ($alnlen / 2) ) || ($alnlen <= $minimumalnlength) ) {
>>>>>>> # $genefusion{ $query } = $hitname;
>>>>>>> # }
>>>>>>> # elsif ( ($evalue >= $thresholdevalue) ||
>>>>>> ($identity <=
>>>>>>> $minimumidentity) || ($bits <= $minimumbitsscore) ) {
>>>>>>> # $diverged{ $query } = $hitname;
>>>>>>> # }
>>>>>>> # else {
>>>>>>> # $besthits{ $query } = $hitname;
>>>>>>> # }
>>>>>>> # }
>>>>>>> }
>>>>>>> }
>>>>>>> $refbh = \%besthits;
>>>>>>> $refdiv = \%diverged;
>>>>>>> $refred = \%redundant;
>>>>>>> $reffus = \%genefusion;
>>>>>>> $cb = keys ( %besthits );
>>>>>>> $cd = keys( %diverged );
>>>>>>> $cr = keys( %redundant );
>>>>>>> $cf = keys( %genefusion );
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioperl-l mailing list
>>>>>> Bioperl-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list