How do you keep clusters running....

Jeff Layton laytonjb at bellsouth.net
Thu Apr 11 13:33:58 PDT 2002


lightdee at netscape.net wrote:

> Doug J Nordwall wrote:
>
> >On Wed, 2002-04-03 at 13:04, Cris Rhea wrote:
> >
> >   What are folks doing about keeping hardware running on large clusters?
> >
> >    Right now, I'm running 10 Racksaver RS-1200's (for a total of 20 >nodes)...
> >
> >    Sure seems like every week or two, I notice dead fans (each RS-1200
> >    has 6 case fans in addition to the 2 CPU fans and 2 power supply > fans).
> >
> >
> >You running lm_sensors on your nodes? That's a handy tool for paying
> >attention to things like that. We use ours in combination with ganglia
> >and pump it to a web page and to big brother to see when a cpu might be
> >getting hot, or a fan might be too slow. We actually saved a dozen
> >machines that way...we have 32 4 processor racksaver boxes in a rack,
> >and they rack was not designed to handle racksaver's fan system. That is
> >to say, there was a solid sidewall on the rack, and it kept in heat. I
> >set up lm_sensors on all the nodes (homogenous, so configured on one and
> >pushed it out to all), then pumped the data into ganglia
> >(ganglia.sourceforge.net) and then to a web page. I noticed that the
> >temp on a dozen of the machines was extremely high. So, I took off the
> >side panel of the rack. The temp dropped by 15 C on all the nodes, and
> >everything was within normal parameters again.
> >
> >
> >    My last fan failure was a CPU fan that toasted the CPU and motherboard.
> >
> >
> >Ya, we would have seen this on ours earlier...excellent tool
>
> [snip]
>
> We use Clusterworx, which isn't open source (from Linux Networx), but it goes a step further than Ganglia.  It uses lm_sensors and a power control
> box (again from linux networx) to actually shutdown a node if it is getting
> too hot, and the event parameters are all tweakable.  It's always a good
> idea to have some kind of cluster monitoring software installed, but it's
> nice to be able to setup event triggers in your software in case something goes wrong and you're not around.

You can set a shutdown temperature via the BIOS on most
decent motherboards. You can also easily script this up if
you have some power control unit connected to a node
that you can talk to (e.g. APC's stuff). All of the stuff you need
it available as Opensource. You can hook all of this together
with Ganglia if you want. In fact, Matt has announced (or hinted)
at the next version of Ganglia that will start to have a number of
new features built in (but not nodal shutdown if I remember
correctly).

Jeff Layton



>
>
> ----
> David Henry
> Synergy Software, Inc.
> lightdee at netscape.net
>
> __________________________________________________________________
> Your favorite stores, helpful shopping tools and great gift ideas. Experience the convenience of buying online with Shop at Netscape! http://shopnow.netscape.com/
>
> Get your own FREE, personal Netscape Mail account today at http://webmail.netscape.com/
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf




More information about the Beowulf mailing list