[Bioperl-l] Odd issue with HSP::seq_inds()

Fri Feb 1 19:03:36 UTC 2008

Apologies ahead of time for the longish post!

While working on a recent bug (http://bugzilla.open-bio.org/show_bug.cgi?id=2436 
) I found a pretty significant bug.  It appears that seq_inds() is  
giving some bad return values tied specifically to reports where  
either or both query or hit are translated.  I'll file a full bug  
report but I felt it might be worth bringing up here to discuss  
specifically what values should be expected when these reports are  
parsed (all three nucleotide positions or only the first for each  
nonmatching translated position, etc).

I used the following script to track the total HSPs as well as those  
where seq_inds() returns positions within the HSP bounds for query and  
hit:

-------------------------------------------

use Bio::SearchIO;
use Test::More qw(no_plan);

my $file = shift || die;

my $io = Bio::SearchIO->new(-format => 'blast',
                             -file => $file);

my ($hsp_co, $correctq, $correcth) = (0,0,0);

while (my $res = $io->next_result) {
     while (my $hit = $res->next_hit) {
         while (my $hsp = $hit->next_hsp) {
             $hsp_co++;
             my ($qs, $qe, $hs, $he) = ($hsp->start('query'),$hsp- 
 >end('query'),
                                        $hsp->start('hit'),$hsp- 
 >end('hit'));
             my @pos = $hsp->seq_inds('query','nomatch');
             my @pos2 = grep {$_ >= $qs && $_ <= $qe} @pos;

             # should always be true
             if (scalar(@pos2) == scalar(@pos)) {
                 $correctq++;
             }

             @pos = $hsp->seq_inds('hit','nomatch');
             @pos2 = grep {$_ >= $hs && $_ <= $he} @pos;

             # should always be true
             if (scalar(@pos2) == scalar(@pos)) {
                 $correcth++;
             }
         }
     }
}

is($correctq, $hsp_co, 'Query seq_inds() positions fall within HSP  
bounds');
is($correcth, $hsp_co, 'Subject seq_inds() positions fall within HSP  
bounds');

-------------------------------------------

Here are the results for the following BLAST report types (culled from  
test reports used in SearchIO.t):

BLASTN : both pass
TBLASTN : query passes, subject fails
BLASTX : query fails, subject passes
TBLASTX : both fail

When looking at the failed query seq_inds() data for the first HSP in  
the TBLASTX report used in SearchIO.t (HUMBETGLOA.tblastx), I get  
these positions for the query:

355 364 365 367 368 370 371 373 374 375

and for the hit:

1941 1942 1943 1945 1946 1948 1949 1951 1952 1961

This is the actual first HSP for that report:

-------------------------------------------

 >gb|AE000479.1|AE000479 Escherichia coli K-12 MG1655 section 369 of  
400 of the complete
             genome
           Length = 10934

  Score = 33.6 bits (67), Expect = 0.13
  Identities = 11/26 (42%), Positives = 16/26 (61%)
  Frame = +1 / -2

Query: 1057 SAYWSIFPPLGCWWSTLGPRGSLSPL 1134
             +A W++FPP+G  W  L  +   SPL
Sbjct: 5893 AAVWALFPPVGSQWGCLASQWRTSPL 5816

-------------------------------------------

It looks like it has something to do with the positions getting  
divided by 3 but not converted back to correspond to the nucleotide  
position.  When fixing this (and as mentioned above), do we want to  
only report the first nucleotide position where the mismatch occurs,  
or report the three positions corresponding to that codon region?

Any thoughts?

chris