[Beowulf] Why Do Clusters Suck?

Tue Mar 22 16:18:31 PST 2005

On Wed, 23 Mar 2005, Stuart Midgley wrote:
> On 22/03/2005, at 7:36, Douglas Eadline - ClusterWorld Magazine wrote:
> > So why do clusters suck?
>  From my position, this issue is really complex.  In the Australian 
> scene, the main reason "clusters suck" has nothing to do with distros, 
> hardware or associated software.  It is more an issue with support 
> staff...
...
> Clusters, by their nature and design, are not simple beasts.

Just like "Math is hard", "Computers are hard".  But there many things 
that can be done to make clusters barely more difficult to use than single
computers.

If you are already used to running a cluster, you may not realize all 
of the extra complexity that you have introduced.  This is especially true 
when you write ad hoc programs and scripts.  When they work, everything is 
looks fine.  But do they work in any other environment, or if anything 
changes, and what happens when they break?

> When everything is running well, you can manage them with almost no staff.  
> However, when something goes wrong the diagnostic/resolution cycle can 
> be long and very complex.

Yup.  It's not how easy it looks when things go right, but how complex the 
system is when things go wrong.  That's a corollary to "an abstraction 
layer is worse than useless when you have to look underneath".

It's important to have diagnosable, documented tools and a system that is 
as simple as possible.

> How to make clusters less sucky?  Well, for a large cluster 
> users/system administrators, decent training would be a good start.
> Training which takes people through the process of building, 
> installing, breaking and fixing a cluster.  Of course, then there is 
> the MPI/application side of things which would be another course.  Try 
> to wrap 10years worth of system/computational experience up into a 5 
> days course ;)

I'm the instructor for many of our introductory training courses.  That is
one my motivations make our system as simple as possible.  Sometimes it's 
faster to write the code to avoid an exception to a general rule than to 
figure out how to explain it.

A good example is handling heterogeneous hardware.  I don't mean mixing 
Alphas with Itaniums with Opterons.  I mean the gritty, everyday kind of 
minor system differences.  Similar looking systems with a different 
Ethernet adapters.  A mix of diskless and disk-based systems.  Different 
versions of PXE.  Toss in a few dual processor machines with one CPU 
removed, a mix of memory sizes, and that flaky disk that you can't quite 
admit is broken.

Each of these differences can potentially be handled automatically.  If 
you do a full install, the installer might handle the difference and you 
might not even notice them... until you consider long-term administration.  
What happens when you do an update?  How do you recover when a system disk 
goes bad?  A cluster need not be a collection of workstation environments, 
and treating it like one adds more complexity than someone with a lot of 
experience might initially perceive.