[Beowulf] Node Drop-Off
hahn at physics.mcmaster.ca
Sun Nov 12 21:15:19 PST 2006
> I have a compute node that has started dropping off. When I say drop off, I
> mean the node (while running a job) will lose all connectivity and the
> machine does not respond. I have viewed the logs and can find no reason for
> the node to cease functioning.
if you connect a console to such a node, is it simply panic'ed?
> Has anyone ever seen such behavior?
I have the occasional node which turns itself off under load.
the IPMI reports power being off, so it's distinct from panics.
the IPMI system-error-log doesn't show any reason.
we (and the vendor) regard this as grounds for repair (usually
the power supply).
regards, mark hahn.
More information about the Beowulf