[Beowulf] 512 nodes Myrinet cluster Challanges

Robert G. Brown rgb at phy.duke.edu
Fri Apr 28 16:36:22 PDT 2006

On Fri, 28 Apr 2006, David Kewley wrote:

> By the way, the idea of rolling-your-own hardware on a large cluster, and
> planning on having a small technical team, makes me shiver in horror.  If
> you go that route, you better have *lots* of experience in clusters. and
> make very good decisions about cluster components and management methods.
> If you don't, your users will suffer mightily, which means you will suffer
> mightily too.

I >>have<< lots of experience in clusters and have tried rolling my own
nodes for a variety of small and medium sized clusters. Let me clarify.
For clusters with more than perhaps 16 nodes, or EVEN 32 if you're
feeling masochistic and inclined to heartache:


Or you will have a really high probability of being very, very sorry.

16 node clusters I've done "ok" with, in the sense that the problems
were manageable.  >32 node clusters, especially if you encounter ANY
ex post facto problems with the hardware configuration -- including ones
that passed through your original prototyping runs (and yeah, they
exist) -- rapidly descend into circle of hell type experiences.
Expensive ones.  Much more expensive in real money, let alone time, than
just buy nodes from a quality vendor of nodes with a 3-4 year onsite
service contract, so if they break they'll come fix them (but they don't
break -- see word "quality" in the above:-).

Other than thinking that "shiver in horror" is somehow inadequate to
describe the potential for misery, I endorse pretty much everything else
David (and Mark) said -- both these guys know whereof they speak.


