[Beowulf] Node Drop-Off

Mark Hahn hahn at physics.mcmaster.ca
Sun Nov 12 21:15:19 PST 2006

> I have a compute node that has started dropping off.  When I say drop off, I 
> mean the node (while running a job) will lose all connectivity and the 
> machine does not respond.  I have viewed the logs and can find no reason for 
> the node to cease functioning.

if you connect a console to such a node, is it simply panic'ed?

> Has anyone ever seen such behavior?

I have the occasional node which turns itself off under load.
the IPMI reports power being off, so it's distinct from panics.
the IPMI system-error-log doesn't show any reason.

we (and the vendor) regard this as grounds for repair (usually
the power supply).

regards, mark hahn.

