[Bioperl-l] Re: bug #949
Jason Stajich
jason@chg.mc.duke.edu
Thu, 14 Jun 2001 10:20:51 -0400 (EDT)
On Thu, 14 Jun 2001, Roger Hall wrote:
> I am not sure what the bug is - do I need to
> 1. leave the test in place comparing frame sign [+-] against the subject
> strand directional sign and change the warning to say 'subject' instead of
> 'query', or
> 2. change the test to compare against the query strand directional sign?
>
The current behavior was that the frame sign (+/-) was dependent on the
query strand. I don't think it is.
Let's walk through a blastx and tblastn report and make sure we agree on
everything
For this specific blastx HSP:
>gi|728837|sp|P39194|ALU7_HUMAN ALU SUBFAMILY SQ SEQUENCE
CONTAMINATION WARNING ENTRY
Length = 593
Score = 161 bits (404), Expect = 8e-39
Identities = 73/97 (75%), Positives = 76/97 (78%)
Frame = -1
Query: 2996 FFLRQSFTLVTQAGVQWHDLSSLQPLPPRFKGFSSLSLPISWDYRRLPPCLANFCIFHKD 2817
FFLR+SF LV QAGVQW DL SLQP PP FK FS LSLP SWDYRR PP ANFCIF +D
Sbjct: 299 FFLRRSFALVAQAGVQWRDLGSLQPPPPGFKRFSCLSLPSSWDYRRPPPRPANFCIFSRD 358
Query: 2816 GVLPCWPGWS*TPDLR*YAHFGIPKCWDYRREPPCPA 2706
GV PCWPGWS TPDLR G+PKCWDYRREPP PA
Sbjct: 359 GVSPCWPGWSRTPDLRXSTRLGLPKCWDYRREPPRPA 395
query start = 2706
query end = 2996
query strand= -1
sbjct start = 299
sbjct end = 395
sbjct strand= 1
Frame = -1
But for the next HSP of the same Sbjct
Score = 156 bits (391), Expect = 3e-37
Identities = 76/97 (78%), Positives = 83/97 (85%)
Frame = +3
Query: 2706 GRARWLTPVIPALWDAKVGISSEVRSSRPAWPTWQNSVFMKNTKISQAWWQAPVIPANWE 2885
GRARWLTPVIPALW+A+ G S EVRSSRPAWPTW N V KNTKIS+AWW+APVIPA E
Sbjct: 1 GRARWLTPVIPALWEAEAGGSPEVRSSRPAWPTWXNPVSTKNTKISRAWWRAPVIPATRE 60
Query: 2886 AEAGESLESRRQRLQ*AEIVPLHSSLGDKSKTLSQKK 2996
AEAGESLE R+RLQ AEI PLHSSLG+KS+T SQKK
Sbjct: 61 AEAGESLEPGRRRLQXAEIAPLHSSLGNKSETPSQKK 97
query start = 2706
query end = 2996
query strand= -1
sbjct start = 1
sbjct end = 97
sbjct strand= 1
Frame = +3
> I have verified that the frame and start/end pair are parsed correctly, and
> that the start/end pair is correctly used to determine the strand direction.
>
> I am puzzled by the line:
> $frame = $2 - 1;
> which comes after the match:
> $frame !~ /^([+-])?([1-3])/
> This reduces the frame domain to [0-2] for every frame assigned. (and I
> don't understand why - string index? :).
frames in blast are 1-3, in GFF they are 0-2, we follow GFF Bioperl.
>
> Even though the warning is made to STDERR, I had no trouble calling various
> methods (score, bits, percent, etc.) on my HSP objects (including verifying
> that the frame domain was now [0-2] - did I mention my puzzlement?). Perhaps
> the 'Devil of a time' is the warnings...?
I believe that is true.
> I compared the test data (sample.blx as provided by the issue originator) to
> their respective results by hand (about a dozen Sbjcts and 150 HSPs), and it
> was obvious that the subject strand sign is properly the opposite of the
> frame sign when the warnings are made.
> It was equally obvious that the
> *query* strand sign was *equal* to the frame sign.
>
Not sure this is true if you look at the second HSP for the first
Sbjct hit in the report. I *think* this should be true for tblastn
reports however which is why that assertion/warning was written into HSP.
> As I said - I'm not sure what the problem is - although I am very familiar
> with BPLite now. :}
>
> aka: Whaddaya want it/me to do?
>
We should probably disable the test/warnings in BPlite::HSP unless we
want to autodetect blastx reports and just supress warnings on those
type of reports.
I believe I am interpreting everything correctly here.
I've committed a shortened version of the submitted blastx report as well as
added the test to BPlite.t so we can be sure and see these msgs.
> Thanks (and sorry for being so green)!
>
> Roger
>
>
> Code Details:
> The frame value is properly parsed in Sbjct::nextHSP(), although it does not
> enforce a domain of [1-3], but rather allows any numerical value.
>
> elsif( $_ =~ /^\s*Frame\s*=\s*([-\+]\d+)/ ) {
> $frame = $1;
> }
>
> The query, homology, and subject lines are parsed, and the start and end
> positions are gathered (subject line match shown).
>
> for(my $i=0;$i<@hspline;$i+=3) {
> $hspline[$i+2] =~ /^Sbjct:\s+(\d+)\s*([\D\S]+)\s+(\d+)/;
> $sl = $2; $sb = $1 unless $sb; $se = $3;
> push @SL, $sl;
> }
> $sl = join("", @SL);
>
> Similar code is used for the query and homology lines.
>
> Once parsed, the new HSP object is created using the frame, start/end
> positions, etc.
>
> After rearranging args, HSP::new() determines the start/end pair vector for
> the strand param to a new Similarity object.
>
> if ($se > $sb) { # normal subject
> $self->subject( Bio::SeqFeature::Similarity->new
> (-start=>$sb, -end=>$se, -strand=>1,
> -source=>"BLAST" ) ) }
> else { # reverse subject: start bigger than end
> $self->subject( Bio::SeqFeature::Similarity->new
> (-start=>$se, -end=>$sb, -strand=>-1,
> -source=>"BLAST" ) ) }
>
> Late in new(), frame() is called, where the normal frame domain is enforced.
> When the param to frame() is defined and matches /^([+-])?([1-3])/, the
> frame value is compared with the subject strand. If the signs are unequal,
> the 'did not match' warning is produced.
>
> Then the absolute value of frame is reduced by one. (?)
>
>
>
> ----- Original Message -----
> From: "Jason Stajich" <jason@chg.mc.duke.edu>
> To: "Roger Hall" <roger@iosea.com>
> Cc: "Hilmar Lapp" <hilmarl@yahoo.com>; "Ewan Birney" <birney@ebi.ac.uk>
> Sent: Wednesday, June 13, 2001 1:18 PM
> Subject: bug #949
>
>
> > I'm starting to look at bug 949 although I'm not particularly excited
> > about it...
> >
> > Let me know if you
> > a) claim it
> > b) have ideas
> >
>
>
Jason Stajich
jason@chg.mc.duke.edu
Center for Human Genetics
Duke University Medical Center
http://www.chg.duke.edu/