[Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?

Gerry Creager gerry.creager at tamu.edu
Fri Oct 23 13:42:38 PDT 2009

Greg Lindahl wrote:
> On Fri, Oct 23, 2009 at 01:01:05PM -0500, Rahul Nabar wrote:
>> 2. Some errors are hardware precipitated. Aging, out-of-warranty
>> aging, hardware can sometimes need such a reboot compromise for
>> one-off random errors.
>> Maybe all the "nice" clusters out there never have this issue but for
>> me it is fairly common. Just confessing.
> Why, exactly, are you assuming that your freezes are one-off random
> errors due to aging hardware? Sounds like you're either guessing, or
> you _are_ doing forensics, but aren't calling it forensics.

*MY* aging hardware usually just falls over dead when it's done with its 
useful life.  Too many intermittent errors/failures causes me to do 
sufficient diagnostics to repair the node (if it's cheap and easy 
enough) or drop it in the latest surplus run.
Gerry Creager
AATLT, Texas A&M University     Tel: 979.862.3982
1700 Research Pkwy, Ste 160     Fax: 979.862.3983
College Station, TX             Cell 979.229.5301
    77843-3139         http://mesonet.tamu.edu

More information about the Beowulf mailing list