[Bioperl-l] timing out a blast in StandAloneBlast.pm

Tue, 2 Jul 2002 18:17:44 +0900

Hi,

I attempted to respond to your this message right away the other day, 
but it disappeared, so I write and send it again.
It also highlights that your problem needs only patience and 
endurance to be solved.

Correct me if I am wrong, but I think, StandAloneBlast is just a nice 
wrapper around the blastall call and the parser you choose. Therefore 
the behavior of the blast search is not different from what you can 
have using blastall alone.
If you kill the process before finishing, you discard all the hits it 
has found sofar. Blast first finishes the search, then sorts the hits 
(HSPs) and then prints (writes) the list in the output file (tempfile 
if you use StandAloneBlast). You can not set the blast search to find 
the best hit first, print it, then search and find the second best 
and so on.

>From scientific point of view I would fundamentally doubt the 
significance of your results, if you establish them using the 
criterion like "... HSPs found in the first five minutes ...", but - 
perhaps luckily - you can not do that.

You can not tell the program : "OK, I have been waiting enough, give 
me the results I want, and do not waste my time searching for those 
HSPs that I am not interested in". Life is not like that.
You can only kill the process in a sulky way, like "If you are still 
not ready, I am not interested at all any more."

If you really MUST reduce the run time, you need to circumvent this 
issue. I can not imagine the scientific question which would let you 
be satisfied with the "firstly found" HSPs, so I can not suggest you 
how to alter the question.
However, if you mean that you would be satisfied with the _best_ 
matches, you can change the parameters (e-value and score) to carry 
out more strict search providing less HSPs.
Repeatmasking is a valid idea, and you can still find the place of 
the masking in the genome.
Similarly, for example, if you have many single copies of the given 
sequence and a few multicopy tandem repeats, then you may run two 
consecutive searches. In the first search use the two (three, ...) 
fold repeated old query sequence as the new query, and double 
(triple, ...) the Score and/or use the square (**3, ...) of the 
e-value. Then mask/disregard the found sequence regions in a second 
search using the old query. (This is simple to program and may or may 
not speed up the whole thing significantly. But that is just an 
example.)

Any way, I would not encourage you to invest effort in timing out the 
blast search.

Have luck.
Peter

> I posted this once before but I didn't get any responses, so I
> thought I'd
> post a little more detail.  If this is the wrong place for asking
> this
> question just let me know.  Thanks!
>
> I am using StandAloneBlast.pm module in my program to run blast and
> so
> far it is great!  I have one problem though, some blasts take hours
> to run
> since they are matching against repetitive element and generating
> tons of
> HSPs.  Since Blast does not have any way of setting a max number of
> HSPs,
> I was thinking about altering the StandAloneBlast.pm module to set 
a
> time
> limit on the blast and just retrieve the results that it got within
> the
> specified period of time.  This would probably require some sort of
> fork
> and exec, rather than a system call and use of the alarm command
> for
> timing.
> I was wondering if anyone has any advice or if someone else has
> already
> generated similar code before?
>
> Probably the easiest solution would be to repeatmask the sequences
> before
> putting them into Blast to limit these problems.  But there are two
> things
> we are worried about, one is over-masking the sequences and two 
even
> if we
> have a sequence that is repetitive element we still need to find a
> place
> for them on the genome.  However, I wonder if I just time out the
> Blast if
> I will get the best HSPs or if they occur randomly in the Blast
> search.
>
> Thank you so much for any advice you can provide,
>
> Bonnie
>