[Beowulf] PowerEdge SC 1435: Unexplained Crashes.
rpnabar at gmail.com
Fri Oct 17 08:37:17 PDT 2008
On Fri, Oct 17, 2008 at 10:22 AM, Nifty niftyompi Mitch
<niftyompi at niftyegg.com> wrote:
> Check the baseboard management controller log (Ctrl+E).
> Tell us what software distribution you are running and any changes that might have
> been made (no matter how small). What is the default run level (is X11 active/ not active).
> Are power saving options enabled in the BIOS?
Distro: Centos 5.2.
Linux node03 2.6.18-92.el5 #1 SMP Tue Jun 10 18:51:06 EDT 2008 x86_64
x86_64 x86_64 GNU/Linux
No changes made to standard kernel. X11 not active. Power saving not enabled.
> Also what hardware monitor software are you running. I have seen system admins add
> their own package to systems only to find that RHEL has an equivalent package
> that uses different device drivers for the same hardware with impossible to diagnose
> results. Custom kernel?
I am not sure what you mean by "hardware monitor software". I do not
recall installing anything special.
> Disable cpuspeed, hardware monitor and hardware control software to see if stability changes.
There are a bunch of Dell utilities that come up at boot-time. BMC,
RAID, Bradcom-PXE, Remote manage controllers. You want me to disable
Stability has already changed. After I swapped motherboard+cpu. No
more dead nodes in over 2 weeks now (yay!) But I just want to make
sure this won't be a recurring problem with these SC1435's before we
go in for our next expansion.
> What additional hardware is in the chassis?
None other than what came with the original Dell units. These are only
2 months old now. They do have dual NICs and no CDROMs. Have disks.
Linked to a Dell KVM via a SIP module. No min-n-matching of Hardware.
Was a monolithic Dell order.
> The "poweredge indicator turning orange" tells me that the problem was detected by the
> system and there should be a hint in the log. The orange state is sticky and
> needs to be cleared....
Funny. It wasn't sticky for me. When I rebooted the orange light
cleared. I did not need to reset it via the BIOS. Unfortunately the SC
series does not have the tiny LCD for an error display.
More information about the Beowulf