Bioperl: off topic Beowulf question
Aaron J Mackey
ajm6q@virginia.edu
Sun, 5 Mar 2000 21:20:52 -0500 (EST)
These questions have been coming up more and more (most recently on
bionet.software or some such newsgroup). Should I PVM/MPI or should I
task-farm. Do I NFS databases or maintain localized copies? Is Linux
better than Solaris better than X. 16 single CPUs vs. 8 dual CPU
machines? Etc. etc.
I don't pretend to have the definitive answers for this, I can only tell
you what has worked (and not worked) for us. But someone listening to
this might think about beginning to compile these things into a beowulf
FAQ or HOWTO
On Sun, 5 Mar 2000, Chris Dagdigian wrote:
> o parallel computing -- investigating performance of MPI-aware
> algorithims and software that can run concurrently against the entire
> cluster of linux boxes. Should be able to get amazing bang for the
> buck performance wise if you have the inhouse talent to handle
> the software side of things.
All the current PVM/MPI versions of blast, fasta and hmmer parallelize by
splitting the database amongst workers (with fasta you also have a little
control over job load, but it's static). This works great for single
queries, but if you're running in "batch" mode (which I often am), then in
our experience it's actually faster to run multiple serial executions
(i.e. task farming). It really depends on your need.
> o distributed blast searching -- farming out searches to cheap linux/BSD boxes
> that have large memory and a single fast disk. I'm interested in clusters where
> the databases are stored locally on disk as well as fooling around with trying
> the same thing but having some type of fast read-only fiber-channel or NFS
> over gigabit ethernet subsystem providing access to a much larger set of searchable
> databases. (currently I have close to 380gigs of blastable databases that I need
> to maintain)
I've heard others argue that NFS-shared databases is a big loser, but we
have had nothing but success (especially when compared to the significant
maintenance task of keeping localized databases updated). Our ethernet is
a fully switched, high bandwidth network (sorry, I don't know the specs
off the top of my head). OS page-caching and blast/fasta memory-mapping
also makes the NFS issue less of one (especially in batch mode as above;
the first run may take awhile longer than normal, but subsequent runs are
blazingly fast and hardly ever use the NFS). In my tests on this network,
task-farming is far speedier than PVM/MPI implementations due to the
severe synchronicity requirement for optimally efficient PVM/MPI batch
execution.
Raphael Clifford and I have written a Perl program called "disperse" that
simplifies running batch jobs in a 'task-farming' protocol. It will be
published soon in Bioinformatics. I'm currently putting the "finishing
touches" on version 2.0 which includes blast, fasta, hmmer, clustalw, and
paml support, as long as a few other packages (adding packages is
trivially easy, you just need to let disperse know a few things about what
a typical command line usually looks like).
-Aaron
--
o ~ ~ ~ ~ ~ ~ o
/ Aaron J Mackey \
\ Dr. Pearson Laboratory /
\ University of Virginia \
/ (804) 924-2821 \
\ amackey@virginia.edu /
o ~ ~ ~ ~ ~ ~ o
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================