[Beowulf] Re:hardware question: building a cluster node/ student
Lombard, David N
dnlombar at ichips.intel.com
Fri Jul 27 07:44:17 PDT 2007
On Thu, Jul 26, 2007 at 08:48:35AM -0700, David Mathog wrote:
> "Nathan Moore" <ntmoore at gmail.com> wrote
>
> > Earlier this summer, the case fan on one of the machines failed, and the
> > result seems like a cooked motherboard (erratic errors with the integrated
> > NIC).
>
> There should be an automatic shutdown script running to detect
> temperature events and shut down the machine before it is damaged.
> This is what I use on some machines:
>
> ftp://saf.bio.caltech.edu/pub/software/linux_or_unix_tools/sensor_monitor.tar.gz
Depending on the board and kernel, ACPI will also provide these services. On
an FC4 (2.6.14) system, I had to do the following to get that to work:
echo 90 > /proc/acpi/thermal_zone/THRM/polling_frequency
echo 80:0:70:65:0 > /proc/acpi/thermal_zone/THRM/trip_points
The first echo caused the auto shutdown to work; the second set the values I
wanted, i.e., shutdown at 80C. Some ACPI cognescenti said the fact that I
had to "manually enable" the polling/shutdown was an error in that version
of the kernel.
I discovered all this when I came home to that sickening overly-hot electronics
smell, a case *very* hot to the touch, and the CPU at 104C due to a dead CPU
fan. Happily, it took a licking and kept on ticking.
--
David N. Lombard, Intel, Irvine, CA
I do not speak for Intel Corporation; all comments are strictly my own.
More information about the Beowulf
mailing list