Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Re: PowerEdge SC 1435: Unexplained Crashes.

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Rahul Nabar rpnabar at gmail.com
Fri Oct 10 08:05:34 PDT 2008


>(1) Tell your Dell salesman that you have asked for help on this problem on
>a public mailing list for High Performance Computing. Tell him/her that you
>need high level Dell support on this. There are Dell customers on this list.

Thanks John. I will do that. A question: how likely is it that this is
a software issue and not hardware from my symptoms? They keep harping
on the fact that I am running a non-validated OS. We used to run
Fedora. Now run CentOS. Same issues. They only support RedHat. I have
a hard time being 100% certain but the more I see it the more I am
convinced it is the hardware.

>(2) Suspect the RAM. Ask some serious questions of your Dell support about
>RAM compatibility - HPC applications stress the RAM. Ask, and ask again, if
>the specific RAM chips you have are certified for that motherboard. Use
>dmidecode to read out the manufacturer codes of the RAM modules - do you
>have a mix of manufacturers?

Very good idea. Never tried that. I will check. I assumed that they
were all similar systems and I had compatible RAM since I bought it
all packaged together.

>Ask and ask again about BIOS updates being available for these machines.
>We had a case once of HP machines - even though the BIOSes were versioned
>the same on 200 machines, there were some differences - turns out you had to
>go as far as checking the build date.
>Get the very latest BIOS version you can.

I have the latest. But that's only based on the version #. I will dig
deeper. Could this be bad BIOS, though, from the symptoms? So, some
code somewhere switches the state of that LED from blue to orange and
if only I knew what the trigger was supposed to be. Someone had to
write that!
>
>(3) The RAM will be the problem - but if you can keep notes and there are
>specific machines which crash more than others point this out to Dell and
>maybe suspect the PSUs being weak on those machines.

Yes. The crashes seem to be very clustered. We have had 5 specific
machines out of 23 crash repeatedly. We swapped the motherboard+cpus
on those and they do not seem to have crashed again as yet. But the
time scale is only about 2 weeks. So I am not very confident of the
statistical significance of my conclusions.

-- 
Rahul



More information about the Beowulf mailing list