How do you keep clusters running....

Robert G. Brown rgb at
Wed Apr 3 15:27:31 PST 2002

On Wed, 3 Apr 2002, Cris Rhea wrote:

> Comments? Thoughts? Ideas?

 a) Use onboard sensors (hoping your motherboards have them) to shut
nodes down if the CPU temp exceeds an alarm threshold.  That way future
fan failures shouldn't cause system failure, just node shutdown.

 b) Use the largest cases you can manage given your space requirements.
Larger cases have a bit more thermal ballast and can tolerate poor
cooling for a bit longer before catastrophically failing.  Gives you (or
your monitor software) more time to react if nothing else.

 c) With only ten boxes, it sounds like you're having plain old bad
luck, possibly caused by a bad batch of fans.  Relax, perhaps your luck
will improve;-)

With all that said, it is still true that maintenance problems scale
poorly with number of nodes.  One reason (of many) that I prefer not to
get nodes from vendors in another state that I never meet face to face.
If your nodes are built by a local vendor (especially one with a decent
local parts inventory and service department) then it is a bit easier to
get good turnaround on node repairs and minimize downtime, especially
since a local business rapidly learns that to make you happy is more
important to their bottom line than making the next twenty or thirty
customers that might walk through their door happy.

There is also the usual tradeoff between buying "insurance" (e.g.
onsite, 24 hour service contracts) on everything and number of nodes.
There are plenty of companies that will sell you nodes and guarantee
minimal downtime -- for a price.  IBM and Dell come to mind, although
there are many more.  Only you can determine how mission critical it is
to keep your nodes up and what the cost benefit tradeoffs are between
buying fewer nodes (but getting better quality nodes and arranging
guarantees of minimal downtime) or buying more nodes (but risking having
a node or two down pending repairs from time to time).

Cost-benefit analysis is at the heart of beowulf engineering, but you
have to determine the "values" that enter into the analysis based on
your local needs.


> Thanks-
> --- Cris
> ----
>   Cristopher J. Rhea                      Mayo Foundation
>   Research Computing Facility              Pavilion 2-25
>   crhea at Mayo.EDU                        Rochester, MN 55905
>   Fax: (507) 266-4486                     (507) 284-0587
> _______________________________________________
> Beowulf mailing list, Beowulf at
> To change your subscription (digest mode or unsubscribe) visit

Robert G. Brown	             
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at

More information about the Beowulf mailing list