Of course, one might say, a well configured HPC compute-node
shouldn't be getting to a hung point anyways; but in-practice I see a
few nodes every month that can be resurrected by a simple reboot.
Admittedly these nodes are quite senile.

I think that this is an interesting concept - and don't want to dismiss
You could imagine jobs which checkpoint often, and automatically restart
themselves from
a checkpoint if a machine fails like this.

My philosophy though would be to leave a machine down till the cause of
the crash is established.
Now that you have IPMI and serial consoles you should be looking at
the IPMI logs and your /var/log/mcelog  to see if there are
uncorrectable ECC errors,
and enabling crash dumps and the Magic Sysrq keys.

Any cluster should be designed with a few extra nodes, which will
normally be idle
but will be used when one or two nodes are off on the Pat and Mick.
However, this doesn't help
when a large parallel run is brought down when a single node fails -
advice here is checkpoint
the jobs often.

