[Beowulf] Pretty High Performance Computing

Robert G. Brown rgb at phy.duke.edu
Thu Sep 25 06:32:58 PDT 2008

On Tue, 23 Sep 2008, Ellis Wilson wrote:

> I guess I don't quite understand why you disagree Prentice.  With the
> exception that middleware doesn't strive to be a classification per se,
> just a solution, it still consists of a "style of computing where you
> sacrifice absolute high performance because of issues relating to any
> combination of convenience, laziness, or lack of knowledge."

OK, now >>I'M<< confused.  According to Wikipedia (which merely
reinforced my existing impression):

   Middleware is computer software that connects software components or

Although one CAN stretch its definition to cover things like sshd,
postfix, and basic network services, they are generally NOT considered
middleware -- middleware (in clustering) is more like MPI or PVM,
xmlsysd, ganglia, although it isn't quite that either.  The term evolved
to describe software that serves as the base for a common UI fronting
distributed services on inhomogeneous architectures, e.g. providing a
coherent user interface to webware or application servers across
architectures.  "Grid" software is usually considered middleware.

For pure clusters, PVM might be called middleware to the extent that PVM
was designed to support OS-inhomogeneous clusters with a more or less
consistent interface (although its development greatly preceded the
invention of the term) where MPI (which is much more commonly
OS-homogeneous) might not.

Vincent makes enough sense that we -- sort of -- know what he means, but
in the context of this discussion, calling standard unix services
"middleware" is probably a moderate misuse of the term.

One thing that I've noted in this discussion is that it is being
conducted in a quite polarized way.  Clusters range from PHPC in all
sizes (there Jon, I used the term:-) running coarse grained not terribly
synchronous down to EP applicaions, with one limit being indeed a "grid"
where middleware IS likely an issue, to large scale (in all respects)
VHPC ("Very" HPC:-) systems designed to run very large tightly coupled
codes.  Jon was basically pointing this out (as were several others).

If you run PHPC clusters and the applications for which they are
appropriate, you could care less about turning off daemons and so on.
There is still something to be said for running a reasonably tight
system and not running extraneous daemons, but this is equally true for
LAN workstation clients (and these are clusters that would and do work
fine as NOW/COW setups, or setups that permit a user to distribute
applications across parts of a "dedicated cluster" and parts of a LAN
using background time on installed workstations).  The rule for good
sysadmin in general is "if you don't need it, don't run it" but of
course if you DO need it, running it is fine up to the point where it
impacts performance and becomes a cost-benefit issue.

If you run VHPC clusters, that is by definition where running even
useful services impacts performance and becomes a cost-benefit issue.
For these clusters (which VARY in scale and the communication properties
of their hardware and code) one has to consider each desired service
separately in the context of what's going on, and given the
heterogeneity of what might be going on architecturally and in
application space, OF COURSE YMMV and no single solution is going to be
"the correct thing to do" for everybody.

This leads people who run such clusters to go in and optimize their
setups by making CBA decisions.  It is undeniable that having postfix or
some other mailer on to permit background monitoring process to generate
mail to the systems manager if something odd is happening, even on a
VHPC cluster.  Oddness on a node (by definition for VHPC) is "delay" and
can cause nonlinear slowdowns of the entire cluster.  If a node chokes
and nobody notices for hours, well, that's hours of lost progress, and
if lack of attention causes a node to crash, it might well take a
tightly coupled code with it and cost one days of work (times the entire

OTOH, there is a nonlinear cost associated with running postfix itself
that one has to trade off against the benefit.  You are trading the
certainty of some degradation of overall cluster performance against the
relatively low probability of avoiding a catastrophic loss of cluster
performance that is many times greater and longer in duration.

However, you're NOT AN IDIOT, you're a cluster systems engineer, linux
god, programmer extraordinaire, and you're used having your cake and
eating it to (if necessary, making your own cake from scratch).  So you
investigate ways of mitigating the degradation while using postfix --
enforcing tight synchronicity across the cluster nodes, bringing up
postfix or sendmail "by hand" inside a script in non-daemon mode and
using it as an MTA to deliver any messages built by checkup programs in
the same script (and never using it as a daemon at all).  If that is
still too heaviweight you cleverly think up and test alternatives --
generating your log messages and using rsync inside your script to write
them back to a tree on a head node (again, highly synchronously across
all the nodes, so you perhaps waste a second or three out of every N
seconds, N tuned to your comfort level, but at least you ensure it is
the SAME second) and then running anything you want on the head node to
gather the information and generate alerts and mail them to you.  If
rsync (with its encryption, handshaking, requisite shell and e.g. cron
script) is still too heavy -- perhaps you don't want to run cron at all
-- you look into xmlsysd to see if IT will scale well enough if the head
node polls all nodes WITHOUT the overhead of a shell script or cron, but
with an unavoidable serialization of the gathering process.  If it still
costs you too much time per N seconds of polling, you can look into
scyld/beostat and get something that approaches the practical limits of
scalability for tightly coupled codes (which is clearly what you are

And I'm sure there are MORE alternatives to consider all the way up and
down the chain, and you are CLEVER and CAPABLE and if none of these
solutions suits you and your particular code mix, you are perfectly
capable of rolling your own, inventing something that provides the
information and services that YOU require while leaving your application
within acceptable performance bounds.

This discussion is useful -- even VERY useful, I'd say -- but it would
be good for everybody to bear in mind that we aren't just comparing
apples and oranges, but an entire universe of fruit.  The issue isn't
that "it is always bad to run useful daemons on cluster nodes" it is
more "WHEN is it bad to run WHICH daemons on cluster nodes and what can
I do to cover the useful to essential services those daemons provided
when I can't just use them in the default/easy way and how can I tell if
the daemons I'm running are in fact degrading performance?"

The latter in particular puts the discussion in just the right context,
and perhaps can help us all reduce the ad hominem that is arising as one
party says "Bananas!" and another insists (shrilly) "No, you fool!
Pineapples!", with a couple of others quietly claiming that although
rare and expensive, "Persimmons" are the only fruit worth eating and
have been shown to have essential antioxidants besides.

So let me formalize this suggestion.

   * WHEN is it bad to run at least some of the common service daemons on
   cluster nodes?  When will it not matter?

   * How can you tell if common service daemons you might have running
   are impacting performance?

   * If they are impacting performance, what can you do about it (ideally
   without losing the benefits of the services they provide)?

   * What are the COSTS of mitigating the performance degradation?  This
   can range from "the hassle of setting up and learning to use X" to
   "building your own custom kernel and toolset" to "buying solution Y
   from the following vendor who has solved the problem for you".

And don't pretend that there are no costs at any level other than the
"it doesn't matter what you do, the marginal cost of running a fat
workstation configuration on your nodes with everything turned on are
STILL a negligible fraction of one percent of total system compute
capacity" as is the case for the embarrassingly parallel NOW/COW PHPC
limit.  Answers from the third * above all have costs, and even
determining the answer to the second * IS a cost of sorts.


> This assumes my understanding of middleware is correct in that it is a
> package or entire system that simplifies things by being somewhat
> blackboxed and ready to go.  Anything canned like tuna is bound to
> contain too much salt.
> Ellis
> Prentice Bisbal wrote:
>> Vincent Diepeveen wrote:
>>> I'd argue we might know this already as middleware.
>> That makes absolutely no sense.
>>> Best regards from a hotel in Beijing,
>>> Vincent
>>> On Sep 23, 2008, at 10:32 PM, Jon Forrest wrote:
>>>> Given the recent discussion of whether running
>>>> multiple services and other such things affects
>>>> the running of a cluster, I'd like to propose
>>>> a new classification of computing.
>>>> I call this Pretty High Performance Computing (PHPC).
>>>> This is a style of computing where you sacrifice
>>>> absolute high performance because of issues relating
>>>> to any combination of convenience, laziness, or lack
>>>> of knowledge.
>>>> I know I've been guilty of all three but the funny
>>>> thing is that science seems to get done anyway.
>>>> There's no doubt computations would get done a little faster
>>>> if I or the scientists spent more time worrying
>>>> about microsecond latency, parallel barriers,
>>>> or XML overhead but reality always gets in the way.
>>>> In the future I hope to sin less often but it's a
>>>> growing experience. Reading this, and other, email
>>>> lists sometimes helps.
>>>> Cordially,
>>>> -- Jon Forrest
>>>> Research Computing Support
>>>> College of Chemistry
>>>> 173 Tan Hall
>>>> University of California Berkeley
>>>> Berkeley, CA
>>>> 94720-1460
>>>> 510-643-1032
>>>> jlforrest at berkeley.edu
>>>> _______________________________________________
>>>> Beowulf mailing list, Beowulf at beowulf.org
>>>> To change your subscription (digest mode or unsubscribe) visit
>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org
>>> To change your subscription (digest mode or unsubscribe) visit
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

Robert G. Brown                            Phone(cell): 1-919-280-8443
Duke University Physics Dept, Box 90305
Durham, N.C. 27708-0305
Web: http://www.phy.duke.edu/~rgb
Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977

More information about the Beowulf mailing list