[Beowulf] Approach For Diagnosing Heat Related Failure?

Joshua mora acosta joshua_mora at usa.net
Tue Jul 21 15:42:30 PDT 2009

You can run HPL bound to a specific socket maximizing also the memory
associated to that socket in order to try to shutdown it because of reaching
the "hardware thermal control" due to lack of cooling.
On BIOS you can also have HW monitoring to tell you speed of fans and perhaps
detect the diff of rpms.
You can also force the system to run at "low power" rather than "dynamic" or
"performance" and rerun tests and see if it passes.

Good luck!


------ Original Message ------
Received: 04:41 PM CDT, 07/21/2009
From: Billy Crook <billycrook at gmail.com>
To: Bill Broadley <bill at cse.ucdavis.edu>Cc: Beowulf Mailing List
<beowulf at beowulf.org>
Subject: Re: [Beowulf] Approach For Diagnosing Heat Related Failure?

> On Tue, Jul 21, 2009 at 15:42, Bill Broadley<bill at cse.ucdavis.edu> wrote:
> >
> > I'd suggest doing a visual inspection.  Make sure all fans are not
blocked by
> > cables, are spinning.  If that looks normal pull the CPU heat sinks and
> > sure they have good coverage with the heat sink goo, but not so much that
> > leaks over the edge of the chip.  When you put the heat sink back on make
> > the heat sink mount works as intended, especially on the (mostly intel?)
> > post system where an unclicked post can result in unevent heat sink
> >
> > Be careful, fans moving != spinning.  I've seen some that just vibrate
> > to look like they are spinning at a casual glance and are actually not
> > much air and are contributing a fair bit of heat to the system (I.e. very
> > to the touch).
> Use the thin end of a zip tie to slowly interrupt and stop each fan
> while it is spinning.  The pitch of the sound it makes will make a
> (very) rough comparison of the RPM, even in a noisy room.  It will be
> obvious if it's turning normally or not.  You might find one blowing
> backwards.  Don't forget about double-rotor fans.
> > If that looks normal then I'd start swapping parts till you find the heat
> > sensitive one.
> He might swap his desk with that overheating node to help balance out
> the heat load...
> Or use something more intense than Memtest in your office.  Try ACT
> Breakin.  Once it's booted all the way, a machine with a heatsink ajar
> is usually powered off from thermal protection in < 5 seconds.  Even
> in an ice cold room
> Try swapping it's power supply with another node that doesn't power off.
> P.S.  And please do not spray liquid spray air upside down at hot
> computer components.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit

More information about the Beowulf mailing list