[Beowulf] RE: Approach For Diagnosing Heat Related Failure?

David Mathog mathog at caltech.edu
Tue Jul 21 14:02:41 PDT 2009

 Jon Forrest <jlforrest at berkeley.edu> wrote:

> I have a rack full of identical compute
> nodes. One of them has become heat sensitive.
> When it's in the warm computer room it crashes.
> I can't even run memtest from the CentOS DVD
> for 2 seconds. However, when this node is
> in my much cooler office everything works
> fine. All the other nodes are working fine
> in the computer room.

Presumably you have already blown the dust out of it and reseated all
the obvious suspect components.

If the motherboard has a "shutdown on overheat" option that may now have
a value set low enough that it stops the machine in the warmer room.  If
you didn't explicitly set it to that value then suspect the motherboard
battery - change it, reset the BIOS, and all should be well.  If the 
machine has a hardware status monitor in the BIOS check there too for
out of range temperatures.

(Odd that your machine room is much hotter than your office.)


David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

