How do you keep clusters running....
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
lightdee at netscape.net lightdee at netscape.netThu Apr 11 11:34:33 PDT 2002
- Previous message: scyld slave node problems
- Next message: How do you keep clusters running....
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Doug J Nordwall wrote: >On Wed, 2002-04-03 at 13:04, Cris Rhea wrote: > > What are folks doing about keeping hardware running on large clusters? > > Right now, I'm running 10 Racksaver RS-1200's (for a total of 20 >nodes)... > > Sure seems like every week or two, I notice dead fans (each RS-1200 > has 6 case fans in addition to the 2 CPU fans and 2 power supply > fans). > > >You running lm_sensors on your nodes? That's a handy tool for paying >attention to things like that. We use ours in combination with ganglia >and pump it to a web page and to big brother to see when a cpu might be >getting hot, or a fan might be too slow. We actually saved a dozen >machines that way...we have 32 4 processor racksaver boxes in a rack, >and they rack was not designed to handle racksaver's fan system. That is >to say, there was a solid sidewall on the rack, and it kept in heat. I >set up lm_sensors on all the nodes (homogenous, so configured on one and >pushed it out to all), then pumped the data into ganglia >(ganglia.sourceforge.net) and then to a web page. I noticed that the >temp on a dozen of the machines was extremely high. So, I took off the >side panel of the rack. The temp dropped by 15 C on all the nodes, and >everything was within normal parameters again. > > > My last fan failure was a CPU fan that toasted the CPU and motherboard. > > >Ya, we would have seen this on ours earlier...excellent tool [snip] We use Clusterworx, which isn't open source (from Linux Networx), but it goes a step further than Ganglia. It uses lm_sensors and a power control box (again from linux networx) to actually shutdown a node if it is getting too hot, and the event parameters are all tweakable. It's always a good idea to have some kind of cluster monitoring software installed, but it's nice to be able to setup event triggers in your software in case something goes wrong and you're not around. ---- David Henry Synergy Software, Inc. lightdee at netscape.net __________________________________________________________________ Your favorite stores, helpful shopping tools and great gift ideas. Experience the convenience of buying online with Shop at Netscape! http://shopnow.netscape.com/ Get your own FREE, personal Netscape Mail account today at http://webmail.netscape.com/
- Previous message: scyld slave node problems
- Next message: How do you keep clusters running....
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
