[Beowulf] Why Do Clusters Suck?
Donald Becker
becker at scyld.com
Tue Mar 22 16:18:31 PST 2005
On Wed, 23 Mar 2005, Stuart Midgley wrote:
> On 22/03/2005, at 7:36, Douglas Eadline - ClusterWorld Magazine wrote:
> > So why do clusters suck?
> From my position, this issue is really complex. In the Australian
> scene, the main reason "clusters suck" has nothing to do with distros,
> hardware or associated software. It is more an issue with support
> staff...
...
> Clusters, by their nature and design, are not simple beasts.
Just like "Math is hard", "Computers are hard". But there many things
that can be done to make clusters barely more difficult to use than single
computers.
If you are already used to running a cluster, you may not realize all
of the extra complexity that you have introduced. This is especially true
when you write ad hoc programs and scripts. When they work, everything is
looks fine. But do they work in any other environment, or if anything
changes, and what happens when they break?
> When everything is running well, you can manage them with almost no staff.
> However, when something goes wrong the diagnostic/resolution cycle can
> be long and very complex.
Yup. It's not how easy it looks when things go right, but how complex the
system is when things go wrong. That's a corollary to "an abstraction
layer is worse than useless when you have to look underneath".
It's important to have diagnosable, documented tools and a system that is
as simple as possible.
> How to make clusters less sucky? Well, for a large cluster
> users/system administrators, decent training would be a good start.
> Training which takes people through the process of building,
> installing, breaking and fixing a cluster. Of course, then there is
> the MPI/application side of things which would be another course. Try
> to wrap 10years worth of system/computational experience up into a 5
> days course ;)
I'm the instructor for many of our introductory training courses. That is
one my motivations make our system as simple as possible. Sometimes it's
faster to write the code to avoid an exception to a general rule than to
figure out how to explain it.
A good example is handling heterogeneous hardware. I don't mean mixing
Alphas with Itaniums with Opterons. I mean the gritty, everyday kind of
minor system differences. Similar looking systems with a different
Ethernet adapters. A mix of diskless and disk-based systems. Different
versions of PXE. Toss in a few dual processor machines with one CPU
removed, a mix of memory sizes, and that flaky disk that you can't quite
admit is broken.
Each of these differences can potentially be handled automatically. If
you do a full install, the installer might handle the difference and you
might not even notice them... until you consider long-term administration.
What happens when you do an update? How do you recover when a system disk
goes bad? A cluster need not be a collection of workstation environments,
and treating it like one adds more complexity than someone with a lot of
experience might initially perceive.
More information about the Beowulf
mailing list