[Beowulf] using watchdog timers to reboot a hung system automagically: Good idea or bad?

ed in 92626 ed92626 at gmail.com
Fri Oct 23 09:01:28 PDT 2009


On Thu, Oct 22, 2009 at 5:56 PM, Rahul Nabar <rpnabar at gmail.com> wrote:

> I wanted to get some opinions about if watchdog timers are a good idea
> or not. I came across watchdogs again when reading through my IPMI
> manual. In principle it sounds neat: If the system hangs then get it
> to reboot after, say, 5 minutes automatically. But, in practice, maybe
> it is a terrible idea.
>


> Of course, one might say, a well configured HPC compute-node
> shouldn't be getting to a hung point anyways; but in-practice I see a
> few nodes every month that can be resurrected by a simple reboot.
> Admittedly these nodes are quite senile.
>
> Some BIOS's have a setting for this, times to reboot before quitting.


> The danger, seems to me: What if a node kept crashing (due to say,  a
> bad HDD or something). Then a watchdog would merely keep rebooting
> this node a hundred times. Not such a good thing.
>
> Have you guys used watchdog timers? Maybe there is a way to build a
> circuit-breaker around the principle so that if a node reboots
> automatically more than 3 times then watchdog gives up?
>

You could also do something at the system level to prevent it. If the system
boots and the previous_uptime is less that one hour shut down the system.
The WD timer will not wake it up.

>
> If one had to do the watchdogging should one do the resets locally
> using the IPMI local interface (hogs cpu cycles) or a central
> Nagios-like system that could issue such a command. Many scenarios
> seem possible. The prospect of a automated system doing a reboot at
> 3am seems more tempting than me having to do this manually.
>
> Also almost all systems that can do this also send out a page and an email
on the event, so someone will know about it.

Ed



>  --
> Rahul
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091023/a4ca8da7/attachment.html>


More information about the Beowulf mailing list