[Beowulf] 512 nodes Myrinet cluster Challanges
Robert G. Brown
rgb at phy.duke.edu
Fri Apr 28 16:36:22 PDT 2006
On Fri, 28 Apr 2006, David Kewley wrote:
> By the way, the idea of rolling-your-own hardware on a large cluster, and
> planning on having a small technical team, makes me shiver in horror. If
> you go that route, you better have *lots* of experience in clusters. and
> make very good decisions about cluster components and management methods.
> If you don't, your users will suffer mightily, which means you will suffer
> mightily too.
I >>have<< lots of experience in clusters and have tried rolling my own
nodes for a variety of small and medium sized clusters. Let me clarify.
For clusters with more than perhaps 16 nodes, or EVEN 32 if you're
feeling masochistic and inclined to heartache:
Or you will have a really high probability of being very, very sorry.
16 node clusters I've done "ok" with, in the sense that the problems
were manageable. >32 node clusters, especially if you encounter ANY
ex post facto problems with the hardware configuration -- including ones
that passed through your original prototyping runs (and yeah, they
exist) -- rapidly descend into circle of hell type experiences.
Expensive ones. Much more expensive in real money, let alone time, than
just buy nodes from a quality vendor of nodes with a 3-4 year onsite
service contract, so if they break they'll come fix them (but they don't
break -- see word "quality" in the above:-).
Other than thinking that "shiver in horror" is somehow inadequate to
describe the potential for misery, I endorse pretty much everything else
David (and Mark) said -- both these guys know whereof they speak.
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
More information about the Beowulf