[Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?

Fri Oct 23 11:01:05 PDT 2009

On Fri, Oct 23, 2009 at 12:35 PM, Mark Hahn <hahn at mcmaster.ca> wrote:
>
>> My philosophy though would be to leave a machine down till the cause of
>> the crash is established.
>
> absolutely.  this is not an obvious principle to some people, though:
> it depends on whether your model of failures involves luck or causation ;)
> and having decent tools (IPMI SEL for finding UC ECCs/overheating/etc,
> console logging for panics) is what lets you rule out bad juju...

Other factors that sometimes make me violate this principle of "always
establish a crash cause":

1. Manpower to debug. Let's say the error has a cause but is
relatively infrequent. I might achieve a higher uptime by a simple
reboot until I get the time to fight this particular fire. People feel
nicer to have a crashed node humming away as soon as possible rather
than waiting for me to get the time to have a look at it and come to a
definite diagnosis. Forensics takes time.

2. Some errors are hardware precipitated. Aging, out-of-warranty
aging, hardware can sometimes need such a reboot compromise for
one-off random errors.

Maybe all the "nice" clusters out there never have this issue but for
me it is fairly common. Just confessing.

-- 
Rahul