[Beowulf] Node Drop-Off

Gerald Davies gerald.davies at gmail.com
Sun Nov 12 13:40:36 PST 2006

On 11/12/06, Tim Moore <twm at tcg-hsv.com> wrote:
> Hello All -
> I have a compute node that has started dropping off.  When I say drop
> off, I mean the node (while running a job) will lose all connectivity
> and the machine does not respond.  I have viewed the logs and can find
> no reason for the node to cease functioning.  Let me state that this
> behavior did not occur until after a processor upgrade, BIOS upgrade and
> OS upgrade.  I went in to the BIOS and made a few changes that seemed to
> prolong it even though its occurrence was mostly random.  If I leave the
> node idle, it will run for days.
> Has anyone ever seen such behavior?

seen that with faulty hardware, but then you've changed a few things.

if you're sure it's not code or the OS then just take another spare
node and try out the different things you've changed processor, bios,
memory (?), step by step.

Gerald Davies


