How do you keep clusters running....

lightdee at netscape.net lightdee at netscape.net
Thu Apr 11 11:34:33 PDT 2002


Doug J Nordwall wrote:

>On Wed, 2002-04-03 at 13:04, Cris Rhea wrote:
>
>   What are folks doing about keeping hardware running on large clusters?
>    
>    Right now, I'm running 10 Racksaver RS-1200's (for a total of 20 >nodes)...
>    
>    Sure seems like every week or two, I notice dead fans (each RS-1200
>    has 6 case fans in addition to the 2 CPU fans and 2 power supply > fans).
>     
>
>You running lm_sensors on your nodes? That's a handy tool for paying
>attention to things like that. We use ours in combination with ganglia
>and pump it to a web page and to big brother to see when a cpu might be
>getting hot, or a fan might be too slow. We actually saved a dozen
>machines that way...we have 32 4 processor racksaver boxes in a rack,
>and they rack was not designed to handle racksaver's fan system. That is
>to say, there was a solid sidewall on the rack, and it kept in heat. I
>set up lm_sensors on all the nodes (homogenous, so configured on one and
>pushed it out to all), then pumped the data into ganglia
>(ganglia.sourceforge.net) and then to a web page. I noticed that the
>temp on a dozen of the machines was extremely high. So, I took off the
>side panel of the rack. The temp dropped by 15 C on all the nodes, and
>everything was within normal parameters again.
>
>
>    My last fan failure was a CPU fan that toasted the CPU and motherboard.
>
>
>Ya, we would have seen this on ours earlier...excellent tool

[snip]

We use Clusterworx, which isn't open source (from Linux Networx), but it goes a step further than Ganglia.  It uses lm_sensors and a power control
box (again from linux networx) to actually shutdown a node if it is getting
too hot, and the event parameters are all tweakable.  It's always a good
idea to have some kind of cluster monitoring software installed, but it's
nice to be able to setup event triggers in your software in case something goes wrong and you're not around.

----
David Henry
Synergy Software, Inc.
lightdee at netscape.net



__________________________________________________________________
Your favorite stores, helpful shopping tools and great gift ideas. Experience the convenience of buying online with Shop at Netscape! http://shopnow.netscape.com/

Get your own FREE, personal Netscape Mail account today at http://webmail.netscape.com/




More information about the Beowulf mailing list