[Beowulf] PowerEdge SC 1435: Unexplained Crashes.

Rahul Nabar rpnabar at gmail.com
Fri Oct 17 08:37:17 PDT 2008


On Fri, Oct 17, 2008 at 10:22 AM, Nifty niftyompi Mitch
<niftyompi at niftyegg.com> wrote:


> Check the baseboard management controller log (Ctrl+E).
>
> Tell us what software distribution you are running and any changes that might have
> been made (no matter how small). What is the default run level (is X11 active/ not active).
> Are power saving options enabled in the BIOS?


Distro: Centos 5.2.

Linux node03 2.6.18-92.el5 #1 SMP Tue Jun 10 18:51:06 EDT 2008 x86_64
x86_64 x86_64 GNU/Linux

No changes made to standard kernel.  X11 not active. Power saving not enabled.

> Also what hardware monitor software are you running.  I have seen system admins add
> their own package to systems only to find that RHEL has an equivalent package
> that uses different device drivers for the same hardware with impossible to diagnose
> results.  Custom kernel?

I am not sure what you mean by "hardware monitor software". I do not
recall installing anything special.

> Disable cpuspeed, hardware monitor and hardware control software to see if stability changes.

There are a bunch of Dell utilities that come up at boot-time. BMC,
RAID, Bradcom-PXE, Remote manage controllers. You want me to disable
those?

Stability has already changed. After I swapped motherboard+cpu. No
more dead nodes in over 2 weeks now (yay!) But I just want to make
sure this won't be a recurring problem with these SC1435's before we
go in for our next expansion.

> What additional hardware is in the chassis?

None other than what came with the original Dell units. These are only
2 months old now. They do have dual NICs and no CDROMs. Have disks.
Linked to a Dell KVM via a SIP module. No min-n-matching of Hardware.
Was a monolithic Dell order.

> The "poweredge indicator turning orange" tells me that the problem was detected by the
> system and there should be a hint in the log.   The orange state is sticky and
> needs to be cleared....

Funny. It wasn't sticky for me. When I rebooted the orange light
cleared. I did not need to reset it via the BIOS. Unfortunately the SC
series does not have the tiny LCD for an error display.

-- 
Rahul



More information about the Beowulf mailing list