thermal kill switch
Robert G. Brown
rgb at phy.duke.edu
Wed Oct 23 02:03:21 PDT 2002
On Tue, 22 Oct 2002, Andre Lehovich wrote:
> We had the air-conditioning fail yesterday. Caught it in
> time to shut down by hand, but we won't be so lucky next
> time. RGB's book recommends a thermal kill switch, but
> doesn't give details on implementation. One obvious idea is
> to have a daemon monitor lm-sensors and shutdown each node
> as it gets too hot. This is easy and cheap.
>
> But, is there anything better? We have not yet had the
> electric and cooling contractors refit our server room. Is
> there anything we should have them install during the
> rewiring? What are the pros/cons of a room-wide kill switch
> vs. the lm-sensors approach?
We have a room-wide kill switch set to be a "last resort". They are
remarkably difficult to find in e.g. a web search, but our architect and
electrical contractors came up with one, so they must be in electrical
component catalogs somewhere if you know where to look.
A second option is to get an electronically readable thermometer (with
one or more sensors) for the ambient room air. netbotz (netbotz.com)
sell moderately expensive (order $1K) monitoring devices that sample
room air temperature, humidity, switch state (so you can get an alarm or
take pictures when a door is opened or a motion detector detects motion)
and have a built in camera and both a web and SNMP interface for remote
monitoring. It generates "alarm" mail if e.g. temperature or sound
levels exceed a given threshold. It is a straightforward matter to hook
a script into one that either polls the device and sends nodes a
poweroff command on an alarm or responds to alarm mail ditto.
If you are a DIY sort of person and don't want to pay for a netbot, you
can build the functional equivalent of a netbot out of component parts
and scripts. A PC-TV card (bttv driver) and an X10 camera will let you
watch real-time video of your cluster room in an xawtv window or serve
you images updated every second or five on a web page -- I have the
scripts and html for the latter already set up, as I have one at home.
To do temperature, you can invest in an ibutton thermochron:
http://www.ibutton.com/ibuttons/thermochron.html
or (perhaps more reasonably) in a sensorsoft thermometer, readable from
an RS232 interface for around $100. Or build your own serial port
readable thermometer for around $35 if you are a real DIY fanatic and
have a 5V power supply handy. Again, scripts to read and act are
necessary, some are already posted on the web. I imagine that one could
set up sound alarms with an ordinary microphone and sound card although
I've never tried it. In our server room we'd be checking to make sure
that the sound level stays HIGH, as the AC is in the room so ambient
noise is like working right behind a jet engine during takeoff. We'd
want an alarm to be triggered if that lovely sound ever went OFF.
lmsensors is the final option, but it has some flaws. For one thing, it
monitors temperatures inside individual systems, not ambient room
temperatures. Not all systems/chips are well supported. The lmsensors
kernel module was designed by individuals who have never heard of the
term "API" (as in, you'll need custom code to glean results for EACH
CHIP AND CONFIGURATION as they don't digest raw output at all -- you
might as well plan to become expert in the particular chip(s) your
systems have to monitor them). Some silly motherboards (the pile of
Tyan dual AMD's we own coming to mind) have insane BIOSn that require(d)
one to hand-enable onboard sensors at the beginning of EACH BOOT in
order to have them functioning and accessible to lmsensors.
In summary, lmsensors is great if it works for you, primarily to
protect individual systems but not so great for protecting the entire
room.
This gives you a pretty wide range of ways to protect and monitor your
cluster/server room, at a wide range of prices -- "free" (if it works)
for lmsensors, a few $100 for DIY or over-the-counter thermal sensors
and video, order of $1000 to get serious integrated monitors that are
almost plug-n-play with a minimal amount of your time and effort
(netbotz are network appliances so they literally plug in, snap onto
your network, get IP from DHCP and can be configured and monitored from
a serial interface or over the network -- a bit windows-centric in
supplied configuration tools as usual, but one CAN get by with minicom).
HTH
rgb
--
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
More information about the Beowulf
mailing list