Bioperl: off topic Beowulf question

Lee W. Jones lj25@ncal.verio.com
Mon, 6 Mar 2000 10:00:59 -0800 (PST)


Hi folks,

I no longer work at the company where I implemented all of this but this
is one of the projects I did while I was there (this one started about 1.5
years ago).  I can just let you know what I tried, what worked well and
what didn't... I never got to beowulf, but it was because it didn't really
seem necessary.

The setup depends more on what your primary goal is -- if you are looking
for a quick response to ad-hoc queries very quickly, than the pvm-mpi
aware versions of systems work well.  If you have a large amount of batch
jobs or are running off of a hi-throughput pipeline, than its a matter of
how to split things among the farm.  You can also do both (but design for
one as the primary focus).

First I'll say what ended up working the best and than point out other
approaches.  If you have the network bandwidth, NFS share the databases
from a central location and pull them over the network.  If the databases
get too big, split them into parts and just collect the information later.
The important thing is not to split based on max file size of the kernel,
but to split based on the memory available on each node of the cluster.
For any of the algorithms where the database is stored in memory and not
scanned off the file system every time, the biggest time suck is actually
the file read and store.  Optimize the database chunks such that the first
read off the networked database is the only read that has to be done.
Linux does a nice job caching things where it will keep things in memory
until it's necessary to purge it.

The next step is queuing and data organization such that there are the
fewest network reads or changes in database at each node.  Partition,
partition partition.  The better job you can do in pre-partitioning the
data and the jobs, the faster the system.  Dedicate machines to single
tasks and don't let them switch until there is enough of a backlog
somewhere else to re-cast and they have finished their allocated piece...

For processing speed, you can get the best of both worlds using a cluster
where each node has multiple processors so that you can use the mpi aware
programs at each node.  There is definitely a point of diminishing returns
here... I'd recommend say for each algorithm on each type of data, do
some work plotting benchmarks on when the linear growth of processors no
longer nets you a linear growth in speed.  Find a reasonable midpoint and
build each node to that spec.

Stick a database at the back end (batch update, not constant writes) for
information collection and have a smart process that takes care of the
scheduling and job logic in the center.  Minimally supervised autonomy for
the cluster.  If you nfs serve the batch queues and or data, than
fail-over becomes pretty trivial.  The end users can monitor the progress
and collect their results off of a database / web-server combination.
Beware the linux nfs server... I love linux but the nfsd system does not
have the same reliability, speed and efficiency of say solaris or a
dedicated NFS server box (net-app, sun, hp all make them).

An additional thing you can do is in the design phase, leave hooks into
the queing system so that you can handle the ad-hoc queries also.  The
submissions just have to go through that central organizer that then farms
it out to the most appropriate subset of nodes.  Give ad-hoc queries a
higher priority in the queue so that the end user doesn't have to wait for
the batch to finish before they get their answers.  It won't be
instantaneous as this method they still submit and go back to the database
to get the results, but the wait isn't very long (typically time between
batch updates to the database).  If there is a need for real response just
dedicate a few machines to just ad-hoc queries and circumvent the
database.

After this system was humming, the data-analysis phase was never the
back log...

Some other attempts proved nicer in theory, but not so well in practice.
CORBA... distributed objects.  Great idea, high overhead slower
performance.  For the network communication, I ended up just writing my
own communication protocol.  In theory this is bad because you've just
made your system very proprietary, but if you aren't selling it and you
just want performance, an optimized comm protocol works better.

Database replication to each machine -- quicker read times for the initial
start ups, but its a nightmare to manage.  Write a central replication
engine that takes care of it, but it kills you network when you are
pushing many many gigs to n number of machines.

Queues maintained at each node -- can use in-memory queues with some
periodic disk backup functions.  Faster, but fail over is a nightmare if a
machine goes down (which they always will for some strange reason).
Actually it was usually network problems where the machine got bounced fof
for some reason, but that doesn't matter.

Real time monitor -- its nice, but its eye-candy that slows down the
system.  Get your progress by checking the file queues and/or the database
results.  To have all members of the farm send back a message at the
completion of a job to the central server gets to be a pain.
Multi-threaded socket server in perl isn't that stable and add on the
visualization side (tk or perl/gtk)... had to do it in C(++) but it was
way more work than what it provided.  PVM can do some of this for you, but
then you lose a lot on the performance side where if you wrote the system
yourself.

Implicit in all of this is the ability to control each program, parse and
understand its output and stick it all together in a schema... Seperate
the process from the program so that the control wrappers communicate with
the process control (farm) system in a generic way.  It makes life easier
later on...

gotta get to work.
hope this helps somebody.

I would have loved to contribute the framework for this to bioperl,
open-source etc., but that doesn't usually fly well to exec management.

lee

On Sun, 5 Mar 2000, Chris Dagdigian wrote:

> o parallel computing -- investigating performance of MPI-aware
> algorithims and software that can run concurrently against the entire
> cluster of linux boxes. Should be able to get amazing bang for the
> buck performance wise if you have the inhouse talent to handle
> the software side of things.
> 
> o distributed blast searching -- farming out searches to cheap linux/BSD boxes
> that have large memory and a single fast disk. I'm interested in clusters where
> the databases are stored locally on disk as well as fooling around with trying
> the same thing but having some type of fast read-only fiber-channel or NFS
> over gigabit ethernet subsystem providing access to a much larger set of searchable
> databases. (currently I have close to 380gigs of blastable databases that I need
> to maintain)
> 
> If nobody else has done this I am more than willing to set up some type of
> email list or web discussion forum that runs off of bioperl.org or spliceome.org. If this
> would be reinventing the wheel than I'd appreciate any URL's or additional info from
> people...
> 
> Regards,
> Chris
> 
> 
> =========== Bioperl Project Mailing List Message Footer =======
> Project URL: http://bio.perl.org/
> For info about how to (un)subscribe, where messages are archived, etc:
> http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
> ====================================================================
> 

=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================