[Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?
Gerry Creager
gerry.creager at tamu.edu
Fri Oct 23 13:42:38 PDT 2009
Greg Lindahl wrote:
> On Fri, Oct 23, 2009 at 01:01:05PM -0500, Rahul Nabar wrote:
>
>> 2. Some errors are hardware precipitated. Aging, out-of-warranty
>> aging, hardware can sometimes need such a reboot compromise for
>> one-off random errors.
>>
>> Maybe all the "nice" clusters out there never have this issue but for
>> me it is fairly common. Just confessing.
>
> Why, exactly, are you assuming that your freezes are one-off random
> errors due to aging hardware? Sounds like you're either guessing, or
> you _are_ doing forensics, but aren't calling it forensics.
*MY* aging hardware usually just falls over dead when it's done with its
useful life. Too many intermittent errors/failures causes me to do
sufficient diagnostics to repair the node (if it's cheap and easy
enough) or drop it in the latest surplus run.
--
Gerry Creager
AATLT, Texas A&M University Tel: 979.862.3982
1700 Research Pkwy, Ste 160 Fax: 979.862.3983
College Station, TX Cell 979.229.5301
77843-3139 http://mesonet.tamu.edu
More information about the Beowulf
mailing list