Bioperl: off topic Beowulf question

Sun, 5 Mar 2000 21:20:52 -0500 (EST)

These questions have been coming up more and more (most recently on
bionet.software or some such newsgroup).  Should I PVM/MPI or should I
task-farm.  Do I NFS databases or maintain localized copies?  Is Linux
better than Solaris better than X.  16 single CPUs vs. 8 dual CPU
machines?  Etc. etc.

I don't pretend to have the definitive answers for this, I can only tell
you what has worked (and not worked) for us.  But someone listening to
this might think about beginning to compile these things into a beowulf
FAQ or HOWTO

On Sun, 5 Mar 2000, Chris Dagdigian wrote:

> o parallel computing -- investigating performance of MPI-aware
> algorithims and software that can run concurrently against the entire
> cluster of linux boxes. Should be able to get amazing bang for the
> buck performance wise if you have the inhouse talent to handle
> the software side of things.

All the current PVM/MPI versions of blast, fasta and hmmer parallelize by
splitting the database amongst workers (with fasta you also have a little
control over job load, but it's static).  This works great for single
queries, but if you're running in "batch" mode (which I often am), then in
our experience it's actually faster to run multiple serial executions
(i.e. task farming).  It really depends on your need.

> o distributed blast searching -- farming out searches to cheap linux/BSD boxes
> that have large memory and a single fast disk. I'm interested in clusters where
> the databases are stored locally on disk as well as fooling around with trying
> the same thing but having some type of fast read-only fiber-channel or NFS
> over gigabit ethernet subsystem providing access to a much larger set of searchable
> databases. (currently I have close to 380gigs of blastable databases that I need
> to maintain)

I've heard others argue that NFS-shared databases is a big loser, but we
have had nothing but success (especially when compared to the significant
maintenance task of keeping localized databases updated).  Our ethernet is
a fully switched, high bandwidth network (sorry, I don't know the specs
off the top of my head).  OS page-caching and blast/fasta memory-mapping
also makes the NFS issue less of one (especially in batch mode as above;
the first run may take awhile longer than normal, but subsequent runs are
blazingly fast and hardly ever use the NFS).  In my tests on this network,
task-farming is far speedier than PVM/MPI implementations due to the
severe synchronicity requirement for optimally efficient PVM/MPI batch
execution.

Raphael Clifford and I have written a Perl program called "disperse" that
simplifies running batch jobs in a 'task-farming' protocol.  It will be
published soon in Bioinformatics.  I'm currently putting the "finishing
touches" on version 2.0 which includes blast, fasta, hmmer, clustalw, and
paml support, as long as a few other packages (adding packages is
trivially easy, you just need to let disperse know a few things about what
a typical command line usually looks like).

-Aaron

-- 
 o ~   ~   ~   ~   ~   ~  o
/ Aaron J Mackey           \
\  Dr. Pearson Laboratory  / 
 \ University of Virginia  \     
 /  (804) 924-2821          \
 \  amackey@virginia.edu    /
  o ~   ~   ~   ~   ~   ~  o

=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================