[Beowulf] What services do you run on your cluster nodes?

Robert G. Brown rgb at phy.duke.edu
Tue Sep 23 04:09:03 PDT 2008

On Tue, 23 Sep 2008, Ashley Pittman wrote:

> On Mon, 2008-09-22 at 20:54 -0400, Perry E. Metzger wrote:
>> By the way, if you really can't afford for things to "go away" for
>> 1/250th of a second very often, I have horrible news for you: NO
> To a large extent you are actually correct, this is one of the reasons
> why building "large" clusters is hard.  If you just plug them together
> and install $mpi the performance will suck for this very reason.
> The thing to remember is "noise" or "jitter" probably doesn't affect
> many people, it's effects become noticeable as job size (not cluster
> size) increases and the effects are non-linear.  You can probably run a
> 128 node cluster and not notice it at all, beyond this scale however and
> time spent turning things off it time well spent.  In fact in my
> experience noise becomes significant not just with size but as as square
> of size.

Note also (agreeing with you, and Greg, and H on) that it is "large
clusters running certain kinds of parallel code".  In particular
synchronous code with lots of communications barriers.

The point is that if you are running a parallel computation with a
barrier (a point where all computation ceases on ALL nodes until a
communication process completes that itself involves all nodes) then
blocking a single node while it handles a random interrupt(ion)
effectively blocks the entire computation.

Part of the problem is the "random" bit.  Computers have to deal with a
lot of asynchronous stuff as this or that wakes up and needs attention.
The timer interrupt, network interrupts, disk interrupts, and yeah,
BIOSen clamoring for attention whenever THEY want to be serviced.

Some of these things are "elective" in the sense that the kernel has
some say on when they get scheduled, and in a large parallel computation
the kernel states CAN self-organize into a partially synchronous state
of operation.  By synchronous I mean that if one could arrange it so
that all the nodes were clock synchronized to microseconds and could
agree to spend the SAME fraction of every second doing ALL housekeeping,
even a very large cluster would get the use of that whole second less
the fraction required by housekeeping less a bit of e.g. context switch
latency and other overhead.

If all nodes are running a synchronous parallel task that runs from the
barrier in about the same amount of time doing about the same amount of
work with the same systems calls in the same order, then everything
"meshes", and if they all enter a wait state (give back the CPU) at
about the same time, the kernel will usually wake up and do its
cumulated lower-half handler work and so on synchronously as well and
life is good.  But randomness -- asynchronicity -- then is the devil, as
it is nonlinearly amplified as soon as it lifts out of the noise in a
truly large synchronous computation.

Embarrassingly parallel applications, on the other hand, don't care.
Even real parallel applications that are coarse grained and aren't
particularly synchronous don't care "much" and will scale way out there
even with a modest amount of background asynchronous noise.

So Don B., does Scyld "solve" this problem by running the node kernels
synchronously, so they all use the same jiffy-window to do housekeeping
(obviously it at least partially solves it by not running anything
unnecessary in terms of housekeeping, but "anybody" can do that:-)?

> Note that it's not just the "OS" fluff which causes problems and turning
> things off doesn't get you anything like 100% of the way there, some
> deamons have to run (your job scheduler) so all you can do it tune them
> or re-code them to use less CPU and some kernel versions are pretty bad,
> one version of Red-Hat was effectively un-usable on clusters because of
> kscand.

The problem actually predates clusters -- it is a very old problem with
LANs in general, especially TCP/IP/ethernet LANs.  Back in the old days
of vampire taps and coaxial cables, all network traffic ran on either a
daisy chain ethernet or through a "hub".  Every packet from every
machine was seen by every other machine on the LAN segment.  Although
the networks were typically segmented at the ethernet level, they were
often not segmented at the the TCP/IP level as routers were expensive
and difficult to manage.  The network interface of one's machine was
therefore effectively on a large flat TCP/IP network one that could
literally span several cities or states.  In a few cases (e.g. decnet)
it could span the nation!.

This meant that there could be hundreds or even thousands of machines
that saw every packet produced by every other machine on the LAN,
possibly after a few ethernet bridge hops.  This made conditions ripe
for what used to be called a "packet storm" (a term that has been
subverted as the name and trademark of a company, I see, but alas there
is no wikipedia article on same and even googled definitions seem
scarce, so it is apparently on its way to being a forgotten concept).

A packet storm occurred because the kernel has to examine every packet
that comes in to its network interface to decide whether or not it is
for it.  Back in those days, computers were slow and a significant
fraction of the total CPU power of your computer could easily be spent
handling packets intended for the other systems on your LAN if the LAN
was heavily loaded.  This created a positive feedback loop in delay --
if one passed a critical threshold in network load, systems would try to
send out packets, find the line busy, and enter the ethernet "random
delay" phase.  In the meantime, the kernel would have new packets
appearing that it had to deal with -- the line was busy.  In the
meantime, more packets would be backing up.  Very quickly the shared
line -- which was designed to be efficient for BURSTY short messages,
not saturation -- would wedge, just like a freeway in rush hour, and
messages would start taking seconds instead of milliseconds to deliver.

The CPU load managing the collisions, retransmissions, and examinations
of saturation-load packet traffic would skyrocket.  System performance
would plummet even on systems not really using the network at the time.
NFS delays, etc, would cause user processes to hang.  Basically an
entire LAN could effectively "crash", although if there weren't timing
problems adding to the collision problem (there usually were, because
nobody liked the restriction on the radius of the wired LAN and daisy
chain networks didn't get retimed except at expensive bridges) in
principle it COULD sort itself all out eventually.

The moral of the story being that everything consisting of many parts
that must move together has scaling limits of sorts, and when it leaves
the regime of linear slow-growth scaling and enters the region where
nonlinear amplification of random noise occurs, bad things will surely
happen.  Truthfully, just suppressing sources of randomness won't do it
when one enters the REALLY nonlinear region as ANY fluctionation can
precipitate a catastrophe (which can be viewed as a dynamic phase
transition from an ordered to a disordered state).  But even before one
gets there, there is a region where fluctuations are amplified with a
gradually diverging lifetime -- they don't kick one into a completely
dysfunctional state but they can feed back and disorder the extended
system far longer than their natural lifetime.

Interesting stuff, actually.


> Ashley Pittman.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

Robert G. Brown                            Phone(cell): 1-919-280-8443
Duke University Physics Dept, Box 90305
Durham, N.C. 27708-0305
Web: http://www.phy.duke.edu/~rgb
Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977

More information about the Beowulf mailing list