[Beowulf] no lm_sensors, slow system, was: Remote console management

David Mathog mathog at mendel.bio.caltech.edu
Mon Sep 26 08:15:09 PDT 2005


Stuart Midgley <sdm900 at gmail.com> wrote:

> 
> Unfortunately, lm_sensors does not work with our montherboards,

<SNIP>

> We pull them out of the  
> cluster and run hardware diagnostics and discover that a fan or  
> something has died and that the cpu is running hot... and has  
> consequently slowed down... resulting in longer run times for user  
> jobs...

I just spent a day trying to figure out why upgrading some
ASUS A7V266E based workstations, from Mandrake 10.0 to 10.2
(aka 2005LE), caused them to run 5X slower.  It turned out that:

A.  lm_sensors had a change between 2.6.x kernel versions
that eliminated the need for a /2 in its config file, resulting
in a CPU temp reading of 105C.

B.  The /2 actually takes place inside the monitor chip, so the
monitor chip "thinks" that the system is at 105C.

C.  The BIOS had an option for controlling overheating detected by
the monitor chip that could be set to either "throttle" or
"shutdown", and it was set for the former.

D.  When the CPU was throttled /proc still reported the CPU
Mhz at the full speed, even though effectively throttle reduced
the Mhz by 5x.

Which is a long winded way of saying that you should check your
BIOS and see if you have the equivalent of "throttle" set.  If so
for cluster work you'd be better served by "shutdown".  It's
a lot less mysterious when a node just shuts down (indicating right
up front that a hardware failure is present) than when things start
running really, really slowly, for no apparent reason.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech



More information about the Beowulf mailing list