[Bioperl-l] Odd issue with HSP::seq_inds()
Chris Fields
cjfields at uiuc.edu
Fri Feb 1 19:03:36 UTC 2008
Apologies ahead of time for the longish post!
While working on a recent bug (http://bugzilla.open-bio.org/show_bug.cgi?id=2436
) I found a pretty significant bug. It appears that seq_inds() is
giving some bad return values tied specifically to reports where
either or both query or hit are translated. I'll file a full bug
report but I felt it might be worth bringing up here to discuss
specifically what values should be expected when these reports are
parsed (all three nucleotide positions or only the first for each
nonmatching translated position, etc).
I used the following script to track the total HSPs as well as those
where seq_inds() returns positions within the HSP bounds for query and
hit:
-------------------------------------------
use Bio::SearchIO;
use Test::More qw(no_plan);
my $file = shift || die;
my $io = Bio::SearchIO->new(-format => 'blast',
-file => $file);
my ($hsp_co, $correctq, $correcth) = (0,0,0);
while (my $res = $io->next_result) {
while (my $hit = $res->next_hit) {
while (my $hsp = $hit->next_hsp) {
$hsp_co++;
my ($qs, $qe, $hs, $he) = ($hsp->start('query'),$hsp-
>end('query'),
$hsp->start('hit'),$hsp-
>end('hit'));
my @pos = $hsp->seq_inds('query','nomatch');
my @pos2 = grep {$_ >= $qs && $_ <= $qe} @pos;
# should always be true
if (scalar(@pos2) == scalar(@pos)) {
$correctq++;
}
@pos = $hsp->seq_inds('hit','nomatch');
@pos2 = grep {$_ >= $hs && $_ <= $he} @pos;
# should always be true
if (scalar(@pos2) == scalar(@pos)) {
$correcth++;
}
}
}
}
is($correctq, $hsp_co, 'Query seq_inds() positions fall within HSP
bounds');
is($correcth, $hsp_co, 'Subject seq_inds() positions fall within HSP
bounds');
-------------------------------------------
Here are the results for the following BLAST report types (culled from
test reports used in SearchIO.t):
BLASTN : both pass
TBLASTN : query passes, subject fails
BLASTX : query fails, subject passes
TBLASTX : both fail
When looking at the failed query seq_inds() data for the first HSP in
the TBLASTX report used in SearchIO.t (HUMBETGLOA.tblastx), I get
these positions for the query:
355 364 365 367 368 370 371 373 374 375
and for the hit:
1941 1942 1943 1945 1946 1948 1949 1951 1952 1961
This is the actual first HSP for that report:
-------------------------------------------
>gb|AE000479.1|AE000479 Escherichia coli K-12 MG1655 section 369 of
400 of the complete
genome
Length = 10934
Score = 33.6 bits (67), Expect = 0.13
Identities = 11/26 (42%), Positives = 16/26 (61%)
Frame = +1 / -2
Query: 1057 SAYWSIFPPLGCWWSTLGPRGSLSPL 1134
+A W++FPP+G W L + SPL
Sbjct: 5893 AAVWALFPPVGSQWGCLASQWRTSPL 5816
-------------------------------------------
It looks like it has something to do with the positions getting
divided by 3 but not converted back to correspond to the nucleotide
position. When fixing this (and as mentioned above), do we want to
only report the first nucleotide position where the mismatch occurs,
or report the three positions corresponding to that codon region?
Any thoughts?
chris
More information about the Bioperl-l
mailing list