[Bioperl-l] timing out a blast in StandAloneBlast.pm

BHurwitz@twt.com BHurwitz@twt.com
Tue, 2 Jul 2002 08:49:55 -0500


Hi Peter,

Thank you so much for your response.  I completely agree with you.  Since
BLAST is not retrieving HSPs in any particular order timing out a BLAST
does not make scientific sense.  Our problem is that our linux boxes seem
to crash on sequences that have a large number of HSPs (yes, we do need
better hardware...).  My thought was to time these out and capture them in
a separate file for more "specialized" processing later.  But, I think
perhaps a better solution is to play with the BLAST parameters to allow
less HSPs through and use this set of parameters for the whole set, as you
suggested.  Unfortunately, BLAST doesn't have options like Megablast does
for limiting searches on "-p percent identity" and "-s score", so I am
working on adding Megablast to the existing StandAloneBlast.pm program,
which will hopefully help me to limit the HSPs a little better.  After
looking at the code for StandAloneBlast.pm it is just a wrapper for BLAST,
so there is no magic going on there that isn't in the regular BLAST
program.  Thank you for your all of your help!

Kind Regards,
Bonnie



                                                                                                                                                  
                    Peter Kos                                                                                                                     
                    <kos@rite.or.jp>         To:     "'BHurwitz@twt.com'" <BHurwitz@twt.com>, "bioperl-l@bioperl.org" <bioperl-l@bioperl.org>     
                    Sent by:                 cc:                                                                                                  
                    bioperl-l-admin@b        Subject:     RE: [Bioperl-l] timing out a blast in StandAloneBlast.pm                                
                    ioperl.org                                                                                                                    
                                                                                                                                                  
                                                                                                                                                  
                    07/02/2002 04:17                                                                                                              
                    AM                                                                                                                            
                    Please respond to                                                                                                             
                    "kos@rite.or.jp"                                                                                                              
                                                                                                                                                  
                                                                                                                                                  




Hi,

I attempted to respond to your this message right away the other day,
but it disappeared, so I write and send it again.
It also highlights that your problem needs only patience and
endurance to be solved.

Correct me if I am wrong, but I think, StandAloneBlast is just a nice
wrapper around the blastall call and the parser you choose. Therefore
the behavior of the blast search is not different from what you can
have using blastall alone.
If you kill the process before finishing, you discard all the hits it
has found sofar. Blast first finishes the search, then sorts the hits
(HSPs) and then prints (writes) the list in the output file (tempfile
if you use StandAloneBlast). You can not set the blast search to find
the best hit first, print it, then search and find the second best
and so on.

>From scientific point of view I would fundamentally doubt the
significance of your results, if you establish them using the
criterion like "... HSPs found in the first five minutes ...", but -
perhaps luckily - you can not do that.

You can not tell the program : "OK, I have been waiting enough, give
me the results I want, and do not waste my time searching for those
HSPs that I am not interested in". Life is not like that.
You can only kill the process in a sulky way, like "If you are still
not ready, I am not interested at all any more."

If you really MUST reduce the run time, you need to circumvent this
issue. I can not imagine the scientific question which would let you
be satisfied with the "firstly found" HSPs, so I can not suggest you
how to alter the question.
However, if you mean that you would be satisfied with the _best_
matches, you can change the parameters (e-value and score) to carry
out more strict search providing less HSPs.
Repeatmasking is a valid idea, and you can still find the place of
the masking in the genome.
Similarly, for example, if you have many single copies of the given
sequence and a few multicopy tandem repeats, then you may run two
consecutive searches. In the first search use the two (three, ...)
fold repeated old query sequence as the new query, and double
(triple, ...) the Score and/or use the square (**3, ...) of the
e-value. Then mask/disregard the found sequence regions in a second
search using the old query. (This is simple to program and may or may
not speed up the whole thing significantly. But that is just an
example.)

Any way, I would not encourage you to invest effort in timing out the
blast search.

Have luck.
Peter

> I posted this once before but I didn't get any responses, so I
> thought I'd
> post a little more detail.  If this is the wrong place for asking
> this
> question just let me know.  Thanks!
>
> I am using StandAloneBlast.pm module in my program to run blast and
> so
> far it is great!  I have one problem though, some blasts take hours
> to run
> since they are matching against repetitive element and generating
> tons of
> HSPs.  Since Blast does not have any way of setting a max number of
> HSPs,
> I was thinking about altering the StandAloneBlast.pm module to set
a
> time
> limit on the blast and just retrieve the results that it got within
> the
> specified period of time.  This would probably require some sort of
> fork
> and exec, rather than a system call and use of the alarm command
> for
> timing.
> I was wondering if anyone has any advice or if someone else has
> already
> generated similar code before?
>
> Probably the easiest solution would be to repeatmask the sequences
> before
> putting them into Blast to limit these problems.  But there are two
> things
> we are worried about, one is over-masking the sequences and two
even
> if we
> have a sequence that is repetitive element we still need to find a
> place
> for them on the genome.  However, I wonder if I just time out the
> Blast if
> I will get the best HSPs or if they occur randomly in the Blast
> search.
>
> Thank you so much for any advice you can provide,
>
> Bonnie
>

_______________________________________________
Bioperl-l mailing list
Bioperl-l@bioperl.org
http://bioperl.org/mailman/listinfo/bioperl-l