[Bioperl-l] Blast stuff
Steve Chervitz
sac at bioperl.org
Mon Jan 27 14:20:31 EST 2003
Good catch, guys. I would also expect seq_inds('conserved') to include
identical matches, since identity definitely means conserved. It should
also jibe with frac_conserved and num_conserved. So I'd call this a bug
and fix seq_inds('conserved') to do what everyone expects.
As for returning the conserved but non-identical matches, I'd recommend
defining an unambiguous argument that won't mislead anyone, such as
seq_inds("conserved-but-not-identical"). Wordy, but explicit. It
probably won't be of concern to the typical user since it doesn't
correspond to any BLAST statistic.
As for backward compatibility, I consider this more of a bug than a
feature, so a change in behavior is warranted. We can of course note it
in the release notes, and even issue a warning when
seq_inds("conserved") is called. The warning would be in the 1.2.1 and
1.2.2 release only.
Steve
On Monday, Jan 27, 2003, at 05:00 US/Pacific, Jason Stajich wrote:
> Ben - That's a fine idea. Definitely something that should be
> applied, I
> would have expected the same thing.
>
> Will look into it today should be trivial to implement.
>
> -jason
>
> On Sun, 26 Jan 2003, Benjamin Berman wrote:
>
>>
>> Hey Jason,
>>
>> For some reason, the bioperl list seems to be rejecting posts I make
>> from
>> this computer, so could you forward this on to steve and the rest of
>> the list.
>>
>> I found another problem I believe. In both the blast and psiblast
>> versions, Hit/HSP::seq_inds('conserved') seems to return the
>> similar/conserved bases alone, not including the identical ones.
>> Common
>> intuition, as well as BLAST itself, would consider conserved (or
>> positive)
>> matches to include both the conserved and identical residues. While
>> the
>> current semantics may be correct under a certain logic, I think it's
>> very
>> misleading- especially because they contradict what is returned by
>> frac_conserved and num_conserved. I spent a long time today tracking
>> down
>> unexpected results I was getting, and it turned out to be because I
>> was
>> assuming that frac_conserved and seq_inds('conserved') were returning
>> the
>> same thing.
>>
>> I think it's useful to provide access to the conserved indices apart
>> from
>> the identical ones. And I guess for backwards compatibility
>> 'conserved'
>> should continue to return what it does now (although I think it would
>> prevent future problems to alter the semantics now). But the
>> documentation
>> should make it abundantly clear in this case what's going on. In
>> addtion,
>> another class such as 'positives', could be added to return the
>> conserved+identical union.
>>
>> thanks for the consideration,
>> ben.
>>
>>
>>
>> At 05:34 PM 1/26/2003 -0500, Jason Stajich wrote:
>>> Ben -
>>>
>>> The reason stems from this: the last HSP is not in the same Frame.
>>> So the
>>> tile_hsps puts it into a separate group - so the last 18 aa are not
>>> included in the longest contiguous alignable region which is used to
>>> calulate that frac_identical for the overall hit. This is in lines
>>> 215-226 in Bio::Search::SearchUtils.
>>>
>>> Steve may have comments on what is the appropriate behavior here.
>>>
>>> I'm commiting some code cleanup made during my little audit but it
>>> should
>>> not functionally change anything.
>>>
>>> Oops - actually one change - I think the logical_length of the hit
>>> should be the length / 3 for a TBLASTN so that put the percent id at
>>> 84%
>>> which is consistent with what Steve was doing in psiblast objects.
>>> Comments welcome there as to what is the correct thing to return
>>> there.
>>> Perhaps the user should be able to request either?
>>>
>>> I'm getting an error with psiblast parser right now because a frame
>>> = '+1'
>>> is flipping out the SeqFeature::Generic obj. Something to look at I
>>> guess.
>>>
>>> Superbowl will have to interrupt further debugging, sorry...
>>>
>>> [snip]
>>>
>>> Score = 52 (220.6 bits), Expect = 0.0, Sum P(4) = 0.0, Group = 1
>>> Identities = 52/52 (100%), Positives = 52/52 (100%), Frame = +1
>>>
>>> Query: 327 STNLAKAAFESKWYEGSLRYKKEILILMAQAQRPLEISARGVIIISLDTFKI 378
>>> STNLAKAAFESKWYEGSLRYKKEILILMAQAQRPLEISARGVIIISLDTFKI
>>> Sbjct: 1090 STNLAKAAFESKWYEGSLRYKKEILILMAQAQRPLEISARGVIIISLDTFKI
>>> 1245
>>>
>>> Score = 18 (79.1 bits), Expect = 0.0, Sum P(4) = 0.0, Group = 1
>>> Identities = 18/18 (100%), Positives = 18/18 (100%), Frame = +2
>>>
>>> Query: 379 LMTITYRFFAVIRQTVEK 396
>>> LMTITYRFFAVIRQTVEK
>>> Sbjct: 1304 LMTITYRFFAVIRQTVEK 1357
>>>
>>>
>>> On Sun, 26 Jan 2003, Benjamin Berman wrote:
>>>
>>>> Dear Jason, Steve, others,
>>>>
>>>> I am trying to parse a wublast-2 output file using either the
>>>> 'blast' or
>>>> 'psiblast' SearchIO modules, and getting results I don't
>>>> understand. I am
>>>> attaching both the blast file (test.bla) and the short test script
>>>> (for_jason.pl) to this mail. Here are the results when I run the
>>>> test
>>> script:
>>>>
>>>> benb at sos: ./for_jason.pl ./test.bla blast
>>>> Hit to CG13158 ==> E=0.00e+00 query=(96.00% or 379 AAs of
>>>> "CG13158-PA" at
>>>> 100.00% iden) subject=(28.00% or 379 AAs
>>>> of "CG13158" at 100.00% iden)
>>>> Query inds ==> 1-396
>>>> Hit inds ==>
>>> 1-259,278-345,364-415,435.333333333333-452.333333333333
>>>> benb at sos:
>>>> benb at sos: ./for_jason.pl ./test.bla psiblast
>>>> Hit to CG13158 ==> E=0.00e+00 query=(96.00% or 379 AAs of
>>>> "CG13158-PA" at
>>>> 100.00% iden) subject=(84.00% or 379 AAs
>>>> of "CG13158" at 100.00% iden)
>>>> Query inds ==> 1-396
>>>> Hit inds ==>
>>> 1-259,278-345,364-415,435.333333333333-452.333333333333
>>>>
>>>> When I look at the blast output, it is clear to me that 396 AAs are
>>>> aligned
>>>> at 100% identity. This is consistent with the output of seq_inds.
>>>> But the
>>>> output of length_aln and frac_aligned seem to be wrong (both
>>>> parsers put
>>>> this at 379, missing the last HSP of 18 AAs). Additionally, the
>>>> psiblast
>>>> one says that this is 84% of the subject entry, while it should
>>>> actually be
>>>> 28% as 'blast' reports.
>>>>
>>>> I am using an up-to-date version of the bioperl-live repository.
>>>>
>>>> Am I doing something wrong?
>>>>
>>>> thanks,
>>>> ben.
>>>>
>>>>
>>>>
>>>>
>>>> At 12:20 PM 12/2/2002 -0800, Steve Chervitz wrote:
>>>>> --- Ewan Birney <birney at ebi.ac.uk> wrote:
>>>>>>
>>>>>> [snip]
>>>>>>
>>>>>> Bio::Tools::Blast.pm
>>>>>>
>>>>>> I want to kill this - it has lots of confusing documentation; it
>>> is big;
>>>>>> I think Bio::SearchIO is the way forward and is stable enough to
>>> take it
>>>>>> on; I will bug fix Bio::SearchIO on spec for the 1.2 series
>>> (phew!). So...
>>>>>> can we remove it and put in a warning in a BEGIN block saying -
>>>>>> it has
>>>>>> died, please try out Bio::SearchIO.
>>>>>>
>>>>>>
>>>>>> SteveC and Jason - what are your votes here.
>>>>>
>>>>> R.I.P. ;-)
>>>>>
>>>>> All of it's functionality has been taken over by other modules for
>>> some time
>>>>> now, so I feel it's ready to go away for 1.2. The Bio/Tools/Blast
>>> directory
>>>>> goes away as well.
>>>>>
>>>>> The warning message in the BEGIN block can be a bit more detailed,
>>> something
>>>>> like a table such as:
>>>>>
>>>>> Functionality Module to Use
>>>>> -------------- --------------
>>>>> BLAST parsing Bio::SearchIO
>>>>> BLAST running Bio::Tools::Run::StandAloneBlast or
>>>>> Bio::Tools::Run::RemoteBlast
>>>>> BLAST HTML formating Bio::SearchIO::Writer::HTMLResultWriter
>>>>>
>>>>> TODO: Transfer the non-BLAST-specific docs from the module-level
>>>>> POD in
>>>>> Bio::Tools::Blast over to Bio::SearchIO, and update it for the
>>>>> SearchIO
>>>>> system.
>>>>> This would include things like how to use the SearchIO::Writer
>>> modules. This
>>>>> would be task for Jason or I.
>>>>>
>>>>> Lately I've been working on merging my psiblast.pm SearchIO module
>>>>> into
>>>>> Jason's
>>>>> blast.pm so that we can get down to just a single Blast parsing
>>>>> system.
>>>>>
>>>>> Steve
>>>>>
>>>>>
>>>>> =====
>>>>> Steve Chervitz
>>>>> sac at bioperl.org
>>>>>
>>>>> __________________________________________________
>>>>> Do you Yahoo!?
>>>>> Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
>>>>> http://mailplus.yahoo.com
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at bioperl.org
>>>>> http://bioperl.org/mailman/listinfo/bioperl-l
>>>>
>>>
>>> --
>>> Jason Stajich
>>> Duke University
>>> jason at cgt.mc.duke.edu
>>
>
> --
> Jason Stajich
> Duke University
> jason at cgt.mc.duke.edu
>
More information about the Bioperl-l
mailing list