[Bioperl-l] timing out a blast in StandAloneBlast.pm
Peter Kos
kos@rite.or.jp" <kos@rite.or.jp
Tue, 2 Jul 2002 18:17:44 +0900
Hi,
I attempted to respond to your this message right away the other day,
but it disappeared, so I write and send it again.
It also highlights that your problem needs only patience and
endurance to be solved.
Correct me if I am wrong, but I think, StandAloneBlast is just a nice
wrapper around the blastall call and the parser you choose. Therefore
the behavior of the blast search is not different from what you can
have using blastall alone.
If you kill the process before finishing, you discard all the hits it
has found sofar. Blast first finishes the search, then sorts the hits
(HSPs) and then prints (writes) the list in the output file (tempfile
if you use StandAloneBlast). You can not set the blast search to find
the best hit first, print it, then search and find the second best
and so on.
>From scientific point of view I would fundamentally doubt the
significance of your results, if you establish them using the
criterion like "... HSPs found in the first five minutes ...", but -
perhaps luckily - you can not do that.
You can not tell the program : "OK, I have been waiting enough, give
me the results I want, and do not waste my time searching for those
HSPs that I am not interested in". Life is not like that.
You can only kill the process in a sulky way, like "If you are still
not ready, I am not interested at all any more."
If you really MUST reduce the run time, you need to circumvent this
issue. I can not imagine the scientific question which would let you
be satisfied with the "firstly found" HSPs, so I can not suggest you
how to alter the question.
However, if you mean that you would be satisfied with the _best_
matches, you can change the parameters (e-value and score) to carry
out more strict search providing less HSPs.
Repeatmasking is a valid idea, and you can still find the place of
the masking in the genome.
Similarly, for example, if you have many single copies of the given
sequence and a few multicopy tandem repeats, then you may run two
consecutive searches. In the first search use the two (three, ...)
fold repeated old query sequence as the new query, and double
(triple, ...) the Score and/or use the square (**3, ...) of the
e-value. Then mask/disregard the found sequence regions in a second
search using the old query. (This is simple to program and may or may
not speed up the whole thing significantly. But that is just an
example.)
Any way, I would not encourage you to invest effort in timing out the
blast search.
Have luck.
Peter
> I posted this once before but I didn't get any responses, so I
> thought I'd
> post a little more detail. If this is the wrong place for asking
> this
> question just let me know. Thanks!
>
> I am using StandAloneBlast.pm module in my program to run blast and
> so
> far it is great! I have one problem though, some blasts take hours
> to run
> since they are matching against repetitive element and generating
> tons of
> HSPs. Since Blast does not have any way of setting a max number of
> HSPs,
> I was thinking about altering the StandAloneBlast.pm module to set
a
> time
> limit on the blast and just retrieve the results that it got within
> the
> specified period of time. This would probably require some sort of
> fork
> and exec, rather than a system call and use of the alarm command
> for
> timing.
> I was wondering if anyone has any advice or if someone else has
> already
> generated similar code before?
>
> Probably the easiest solution would be to repeatmask the sequences
> before
> putting them into Blast to limit these problems. But there are two
> things
> we are worried about, one is over-masking the sequences and two
even
> if we
> have a sequence that is repetitive element we still need to find a
> place
> for them on the genome. However, I wonder if I just time out the
> Blast if
> I will get the best HSPs or if they occur randomly in the Blast
> search.
>
> Thank you so much for any advice you can provide,
>
> Bonnie
>