[Beowulf] best archetecture / tradeoffs

Robert G. Brown rgb at phy.duke.edu
Fri Aug 26 20:59:21 PDT 2005

Seth Keith writes:

> Hello everyone, I am Seth Keith, this is my first mailing.
> I am new to the distributed computing thing, but despite this I find 
> myself constructing a Beowulf system. I have put together a few 
> different types and experimented and read enough to realize it is time 
> to solicit outside opinions. I really hope to get some good advice. If 
> you have the time please help me out...
> My requirements are easy, I think, since my program is already broken up 
> into a lot of different programs communicating via STDIN/OUT. I 
> benchmarked and found my problem is CPU intensive. The overall data 
> transfer is small, but all the different parts need to be assembled 
> before the final pass on the data. The final pass cannot be broken up, 
> but the final pass is fast. So my model is  input data -> break up into 
> N workers -> assemble results -> process -> done.
> I need advice on a few of the tradoffs:
> 1) diskless vs disk
> I am thinking diskless is better. I don't worry about network traffic as 
> much as power consumption, overall node cost, reliability, and ease of 
> management. My nodes are all identical, so I figure diskless, right?  
> Well I am having a few problems...

Diskless vs disk isn't THAT simple.  Diskless nodes rely on the server
which gives you a single point of failure.  Diskfull nodes can be
configured to standalone.  In fact, a diskfull node cluster doesn't need
to have any particular "head node" so that any node can go down and you
can work with the remaining nodes.  Also, as you're finding out, there
is a bit more of a learning curve for diskless when things don't work
perfectly the first time.

> I still don't know exactly about swap. One of the clusters I set up was 
> an NFS mounted root file system that did something with swap to 
> /dev/loop0, but I don't really understand that, is the swap going onto 
> the nfs drive or is it just back into memory? What is the best ( fastest 
> ) way to handle swap on diskless nodes that might sometimes be 
> processing jobs using more than the physical RAM?

The best way to handle swap is to not swap, and certainly not to swap to
memory (any sort of ramdisk).  Under ANY circumstances, swapping is
almost certain to slow you down more than clustering speeds it up.
Disks are just a lot slower than memory under the very best of
conditions.  Networked remote disk even more so, if you manage to work
this out.  So equip the nodes with enough memory for the operating
system and all jobs to be resident at once, times maybe two if you can
afford it, especially if you want to run diskless.  Linux will cache and
buffer everything so that you will actually go over the network back to
the NFS server only rarely, when starting something up or loading a
cache.  Use the money you save on disk to buy more memory.

Or, if you MUST swap, I'd strongly advise not going diskless, or
thinking about some of the more exotic ways of doing business (using
memory on remote systems as a network based "swap" for example).
Usually swapping means that you are trying to run a really big job, and
if so there are USUALLY other ways to approach it -- partitioning the
problem and parallelizing, buying more memory, whatever.  The only time
swapping makes sense is when you're running a big job that is SO
valuable that finishing it -- however slowly and painfully -- is worth
it even if it takes 100 or 1000 times as long as a memory based job
might have taken.

> Also, is it really true you need a separate copy of the root nfs drive 
> for every node? I don't see why this is. I have it working with just one 
> for all nodes, but am I missing something here?

There can be problems with certain parts of a normal system's operation
if they all share a single root.  For example, how do they manage
writing to the same file(s) in /var?  What about files in /etc that give
a system some measure of identity (e.g. ssh keys).  What about
applications that create e.g. file-based locks in /tmp that can just
plain collide if you try to run them on several nodes all trying to
create their lockfile in the same place?  To be robust and reliable, you
need to implement some sort of sharing-aware file locking, but unix
processes and applications often aren't written to acquire locks or
respect locks made under the assumption of a multiple-system shared FS
when doing various tasks.  You even run the risk of really breaking
something if two systems open the same critical file at almost the same
time, make a change, and then write back to it, then reread, then
rewrite, and so on.  Each system ASSUMES that what it just wrote is
there, but what is there for at least one of them is what the OTHER
system just wrote.  So rolling your own single-exported-root cluster can
work, or can appear to work, or can work for a while and then
spectacularly fail, depending on just what you run on the nodes and how
they are configured.

There are, however, ways around most of the problems, and there are at
this point "canned" diskless cluster installs out there where you just
install a few packages, run some utilities, and poof it builds you a
chroot vnfs that exports to a cluster while managing identity issues for
you transparently.  You might check out warewulf, for example, which
"prefers" to create a virtual cluster that boots and runs diskless in
this manner (although it can also create a diskful cluster according to
the docs).  This sort of kit usually gives you a mailing list and
collegial support when you're getting started, too.  Warewulf is
particularly nice because it tries to be distribution agnostic -- you
don't "have to use RHEL" or "have to use SuSE" or even "have to use
CaOSity" (upon which it is actually based) -- use whatever you like that
is more or less standard at the service layer (nealy any linux

> 2) message passing vs roll yer own
> I have played with a few different packages, and written a bunch of perl 
> networking code, and read a bunch and I am still not sure what is 
> better. Please chime in:
>     - what is the fastest way to run perl on worker nodes. Remember I 
> don't need to do anything too fancy, just grab a bunch of workers, send 
> jobs to them, assemble the results, send the results to another worker, 
> etc. I don't need to broadcast to all nodes or anything else.

Perl these days supports threads.  I wrote (for Cluster World Magazine,
a year plus ago) a demo parallel program that did master/slave parallel
programming by:

  a) forking off threads per node in the master task;
  b) cranking up ssh's per thread to distribute a task and data;
  c) collecting the data written back per thread;
  d) terminating the threads, assembling the data;
  e) writing out the results and quitting.

As long as your master can keep track of who's on first, etc. (perhaps
an associated hash table) you're in business.

Note that I have no idea if this is the fastest or best way, just that
it is one way.  In perl, after all, TMTOWTDI no matter what it is, says
it right there in the perl book.

>     - what is the easiest way to do it. I wrote the whole thing in perl 
> already, and I was not really impressed with the speed or reliability. 
> Certainly this was at least partially programmer error, but my question 
> stands, what is the easiest way to reliably control a cluster of worker 
> nodes running different perl programs, and assembling the data. This 
> includes load balancing.

Aw, now you're talking all crazy about contradictory things.  You want
speed, reliability, control, and load balancing, all with a
do-it-yourself script?  To quote Austin Powers, "Well, I want a
gold-plated potty but it doesn't mean I'm going to get one...";-)

Seriously, if you want these features (especially in perl) you have to
program them all in, and none of them are terrible easy to code.  The
use of threads that I indicate above COULD be extended to do the control
and load balancing part -- that isn't that hard.  For non-IPC-intensive
jobs, speed of the remote shell part is irrelevant -- if you're getting
poor performance it is probably because you're blocking somewhere and
maybe using threads will unblock you.

Reliability of ANY sort of networking task (run through perl or not) --
well, there are whole books devoted to this sort of thing and it is Not
Easy.  Not easy to tell if a node crashes, not easy to tell if a network
connection is down or just slow, not easy to restart if it IS down,
especially if the node drops the connection because of some quirk in its
internal state that -- being remote -- you can't see.  Not Easy.

If you want better control or master execution efficiency than you get
with perl and threads or the like, try using PVM or MPI in C.  You still
have to do load balancing yourself, and you still have to manage
restarting tasks on downed nodes and reliability, but the control part
and the speed part are then as good as they get, and adding balancing
and reliability as easy as it gets.

One last alternative for Master Geeks Only is to write your own
task execution daemon for the nodes and your own master application to
create and maintain the connections to it.  With this you can avoid the
overhead of a shell process altogether (a significant part of what makes
remote shells slow) and the encryption/authentication that can make
ssh's slower still.  Although as has been pointed out on list, rsh is by
its nature a pretty minimal remote execution facility already, security
bleeding sore that it is.

>     - I saw some information on clusters that were linked in the kernel 
> and acted as a single machine. Is this a working reality? How does the 
> performance of such a system compare with message passing for dedicated 
> processing such as my own.

See Scyld. (www.scyld.com).  Bring your checkbook.  But yeah, being
written/maintained by some of the best and brightest, you can expect its
performance to be "better" than your efforts.  Yes.  I think that's safe
to say.  Better.

OTOH, if you're using perl as a job distribution mechanism, you have
lots of room to improve without resorting to scyld, and you can always
try bproc (which is a GPL thing that is at the heart of scyld, or used
to be) on your own for free.  Google will help you find it.  Search for
e.g. clustermatic.  Or look for links on brahma
(www.phy.duke.edu/brahma) -- pretty sure they are there.

>     - I was playing with MPICH-2, is this better than LAM? What about 
> other message passing libraries what is the best one?  any with direct 
> hooks into perl?

Now you're trying to start religious wars, aren't you?  Troublemaker;-)

Let me be democratic.  They're all better.  The one you are likely to
like the best is the one you use the most, and all of them will permit
you to write a decent parallel program.  LAM has the advantage of coming
prebuilt with e.g. FCX, so it is 'easy' to install and run.  MPICH-2 has
nifty new extensions that make it argueably more powerful (but also
more complex).  They manage somewhat differently, IIRC, at the UI level.
PVM (my own favorite) is semi-obsolete and not being heavily developed,
but it has some very nice features and "old guys" in the cluster
business seem to like it.  I honestly don't know if any of them hook
into perl, but of course perl can manage raw sockets (or even a
cluster's worth of raw sockets) just file -- I once wrote a perl-Tk GUI
application that connected to homemade daemons on a whole cluster to
retrieve system statistics.  That's where I learned (the hard way) just
how difficult it is to manage that reliability thing -- if a node hangs
(dropping one connection in a list) what does your application do?  This
was pre-thread perl, so one distinct possibility was to hang the
application waiting for a message that never comes.  To avoid this, you
have to start to learn WAY more than you likely want to about sockets
and network connections.  And it is still something of a problem, IMO,
even in the real parallel message libraries.

>     - how fast is NFS and RSH. If I were to change the code so it works 
> with a NFS mounted file instead of STDIN/OUT and I use RSH to 
> communicate how would the speed compare with message passing? with 
> direct perl networking?

Again, I'm not sure what you mean by "direct perl networking", but I'll
take a stab at this.  NFS is not a good way to communicate between node
tasks in almost all circumstances.  In addition to being slow, you
either have to have the client tasks polling (inefficient) or an
out-of-band connection to tell the client "there is a message in file X,
wake up and read it".  Also inefficient.

Now, perl talking socket to socket to perl is as efficient as perl, and
sockets, can be.  In the case of the perl part, that isn't terribly
efficient compared to native C doing exactly the same thing, but it
isn't too shabby, either -- within a factor of ten, probably (which can
still be pretty fast).  The thing you want to avoid is the overhead of
constantly setting up and taking down the sockets (so you need a
persistent connection) and you still have to deal with EITHER the
overhead of perl forking a shell to run the task, collect the data, and
ship it back over the socket or perl doing the computation directly in
the perl slavey ditto.

Perl using rsh to spawn a process on a node and collect the results
should be quite efficient.  I used to benchmark rsh connections to
compare to ssh, and one can rsh a host and run a simple command and
collect results 10+ times per second.  ssh isn't even THAT bad, at maybe
3 tasks per second, and is much more secure, but if a lot of data is
moved its overhead goes up and relative efficiency goes down.  The perl
part of that transaction should be forking a thread (done once per
node?) and starting up the ssh in the connection.  From what I
>>recall<< watching a parallel PVM task start up vs staring up threaded
perl parallel tasks, I think that that factor of 3:1 is a fair estimate
of relative efficiency -- PVM and daemon-direct stuff is going to be
faster than rsh and/or ssh by as much as an order of magnitude but not
much more than that, I don't think.

Note that if your task is GRANULATED correctly, this is utterly
irrelevant.  If it takes you a whole second per node to set up tasks on
the nodes, it simply doesn't matter IF the tasks take a day to run
(86400 seconds).  If it takes the tasks a whole second to run (after
taking a second to set up) you might as well not bother parallelizing
them, as running them serially on a single node will be as fast or
faster.  So only for certain ranges of execution time relative to
startup time should you even care what the "most efficient" or "fastest"
master might be.  Worry instead about which one you can most easily code
and maintain, whether or not the result needs to be portable, ease of
debugging, ability to get support and help from your friends...:-)

> 3) Distribution and kernel
> I create my NFS system by copying directories off my RH9 distribution. I 
> had lots of problems and could never get everything working. I think it 
> would be loads easier if I could find a standard distribution image 
> already constructed somewhere out there... I don't really care what 
> distribution as long as I can run perl.

See warewulf, as I said.  Here:

  a) First upgrade from RH 9.  You want to be running a modern kernel
and modern perl (with threads) right?  Reinstall using FC 4 or Centos 4
-- they'll be "just like" RH 9 except better.  In particular, they'll
have 2.6 kernels and yum, both of which you WANT on your master and your

  b) Visit warewulf-cluster.org.  Get warewulf source rpms.  Also get
support program rpms (MAKEDEV and a perl module).  rpmbuild --rebuild
all three.  Install all three on the master.

  c) Follow instructions on site and in /usr/share/doc and in man pages
for creating a vnfs (virtual nfs root) in a chroot directory.  There are
sample scripts that pretty much "do it all" (and which rely on yum,
mostly, to do the actual node system build).

  d) Follow the rest of the instructions for exporting the directory and
remote booting the nodes.  Most of it is done via fairly simple config

> I keep seeing people advising against the NFS root option and advocating 
> ram disk images. Opinions here? Where can I get ram disk images? I would 
> be nice to download a basic complete ram disk image, that boots with 
> root rsh working already.

Somebody will probably tell you where to get and how to install one,
although google is your friend as well.  However, if you're concerned
about having enough memory and swapping... the thing about NFS root is
that it only reads most of the files it needs (and ONLY what it needs)
one time, to get an image into memory.  All the stuff in the NFS root
that it doesn't need doesn't get loaded, and anything that is loaded but
merely cached CAN be overwritten by applications and the next time you
need it the system can go back to NFS root to get it.

Otherwise, sure, a ramdisk root is probably marginally faster ONCE it is

Most of the linux installs have a ramdisk root that they load as one
stage of the install.  So they're not terribly difficult to do and there
are probably HOWTOs on the subject and/or project sites that make it
easy. GIYF.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050826/5eff3441/attachment.sig>

More information about the Beowulf mailing list