[Beowulf] using watchdog timers to reboot a hung system automagically: Good idea or bad?

Thu Oct 22 17:56:16 PDT 2009

I wanted to get some opinions about if watchdog timers are a good idea
or not. I came across watchdogs again when reading through my IPMI
manual. In principle it sounds neat: If the system hangs then get it
to reboot after, say, 5 minutes automatically. But, in practice, maybe
it is a terrible idea.

Of course, one might say, a well configured HPC compute-node
shouldn't be getting to a hung point anyways; but in-practice I see a
few nodes every month that can be resurrected by a simple reboot.
Admittedly these nodes are quite senile.

The danger, seems to me: What if a node kept crashing (due to say,  a
bad HDD or something). Then a watchdog would merely keep rebooting
this node a hundred times. Not such a good thing.

Have you guys used watchdog timers? Maybe there is a way to build a
circuit-breaker around the principle so that if a node reboots
automatically more than 3 times then watchdog gives up?

If one had to do the watchdogging should one do the resets locally
using the IPMI local interface (hogs cpu cycles) or a central
Nagios-like system that could issue such a command. Many scenarios
seem possible. The prospect of a automated system doing a reboot at
3am seems more tempting than me having to do this manually.

-- 
Rahul