[Beowulf] Re: PowerEdge SC 1435: Unexplained Crashes.

Rahul Nabar rpnabar at gmail.com
Fri Oct 10 07:47:32 PDT 2008


>t's a pretty unusual hang. I bet that the reason that you don't get
>a kernel crash dump is that the kernel doesn't run long enough after
>the problem happens to create one.

Thanks Greg. I suspected that....I am actually curious: when exactly
can kdump be useful? If a crash is hardware precipitated the second
kernel never gets a chance to do what its supposed to. If it is
software related, and the first kernel actually has time to detect the
inconsistancy then it might as well "deal" with the offending process.

>Probably your fastest solution is to swap parts until works. Tedious,
>but...

That's exactly what I'm doing so far! :-) Problem is which ones? CPU /
MB/ Power supplies /RAM  ? I've even received solutions as exotic as
re-flashing the BIOS / ESM firmware upgrades / processor reseating.
Any bets on the likelihood based on symptoms, intuition, and past
experience?

 We've swapped CPUs and processors on all the offending nodes. Seems
to have worked so far (i.e. none of the swapped machines have
re-crashed) But I'm hesitant to conclude "problem solved" since all
this is only over the last 2 weeks.

I'm dreading the day when one of the swapped machines re-crashes!
Let's see........

-- 
Rahul



More information about the Beowulf mailing list