[Beowulf] Approach For Diagnosing Heat Related Failure?

Jon Forrest jlforrest at berkeley.edu
Tue Jul 21 11:56:23 PDT 2009

I have a rack full of identical compute
nodes. One of them has become heat sensitive.

When it's in the warm computer room it crashes.
I can't even run memtest from the CentOS DVD
for 2 seconds. However, when this node is
in my much cooler office everything works
fine. All the other nodes are working fine
in the computer room.

I'm not convinced the problem is actually
the memory. Other than opening the node
to spray cooling liquid when it's in the warm
room, what approach would you use to figure out which
component(s) is(are) failing?

Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
jlforrest at berkeley.edu

More information about the Beowulf mailing list