[Beowulf] Re: PowerEdge SC 1435: Unexplained Crashes.

Rahul Nabar rpnabar at gmail.com
Fri Oct 10 08:05:34 PDT 2008

>(1) Tell your Dell salesman that you have asked for help on this problem on
>a public mailing list for High Performance Computing. Tell him/her that you
>need high level Dell support on this. There are Dell customers on this list.

Thanks John. I will do that. A question: how likely is it that this is
a software issue and not hardware from my symptoms? They keep harping
on the fact that I am running a non-validated OS. We used to run
Fedora. Now run CentOS. Same issues. They only support RedHat. I have
a hard time being 100% certain but the more I see it the more I am
convinced it is the hardware.

>(2) Suspect the RAM. Ask some serious questions of your Dell support about
>RAM compatibility - HPC applications stress the RAM. Ask, and ask again, if
>the specific RAM chips you have are certified for that motherboard. Use
>dmidecode to read out the manufacturer codes of the RAM modules - do you
>have a mix of manufacturers?

Very good idea. Never tried that. I will check. I assumed that they
were all similar systems and I had compatible RAM since I bought it
all packaged together.

>Ask and ask again about BIOS updates being available for these machines.
>We had a case once of HP machines - even though the BIOSes were versioned
>the same on 200 machines, there were some differences - turns out you had to
>go as far as checking the build date.
>Get the very latest BIOS version you can.

I have the latest. But that's only based on the version #. I will dig
deeper. Could this be bad BIOS, though, from the symptoms? So, some
code somewhere switches the state of that LED from blue to orange and
if only I knew what the trigger was supposed to be. Someone had to
write that!
>(3) The RAM will be the problem - but if you can keep notes and there are
>specific machines which crash more than others point this out to Dell and
>maybe suspect the PSUs being weak on those machines.

Yes. The crashes seem to be very clustered. We have had 5 specific
machines out of 23 crash repeatedly. We swapped the motherboard+cpus
on those and they do not seem to have crashed again as yet. But the
time scale is only about 2 weeks. So I am not very confident of the
statistical significance of my conclusions.


