[Beowulf] S2466 nodes poweroff, stay off
David Mathog
mathog at mendel.bio.caltech.edu
Thu Dec 2 11:37:31 PST 2004
A while back I upgraded my S2466 nodes from RH 7.3 to Mandrake 10.0
with a 2.6.8-1 kernel (from kernel.org). Recently I've
discovered that poweroff on these nodes tends to be permanent.
The node does power down, and afterword pressing the
power button on the front panel lights it up and the
fans spin - but it won't boot, not even
to the BIOS. It doesn't beep error codes either, it just sits
there. Usually the reset button does nothing, but sometimes
it will allow a reboot. Sometimes hitting the reset button
2 or 3 times in rapid succession will boot it. Sometimes not.
All of these different behaviors have been observed on a single
node, it isn't that one node does it one way and another
the other way.
I hadn't noticed this poweroff to never never land
since the upgrade previously because the only time one
was powered down was to pull a node, and that required
unplugging it. Unplugging it for a while resets the
problem and it will start. The only
way to reliably boot one now following a poweroff is to:
unplug for 1 minute
replug
power on
(and for that extra je ne sais quois which seems to raise the
success rate to 100%)
[ wait a few seconds, then hit reset ]
After that it boots normally.
One possible clue, when logging to a serial line the end
of the poweroff sequence is:
#normal shutdown sequence messages deleted
Power down.
acpi_power_off called
ACPI-0352: *** Error: Looking up [IO2B] in namespace, AE_NOT_FOUND
search_node f7f4f220 start_node f7f4f220 return_node 00000000
ACPI-1133: *** Error: Method execution failed [\_PTS] (Node
f7f4f220), AE_NOT_FOUND
/etc/modprobe.preload
contains a "button" line. button is loading (lsmod shows it). If it
didn't the front panel button wouldn't respond at all after a power
down.
lilo.conf has "acpi=on" for starting the kernel.
Some possibly relevant BIOS settings (the same on all nodes)
ACPI enabled
ECC SCRUB enabled
Quickboot enabled
Diagnostic disabled
Summary disabled
In /var/log/messages it says:
Dec 2 12:15:52 monkey08 kernel: ACPI: Subsystem revision 20040326
Dec 2 12:15:52 monkey08 kernel: ACPI: Interpreter enabled
Dec 2 12:15:52 monkey08 kernel: ACPI: Using IOAPIC for interrupt routing
Dec 2 12:15:52 monkey08 kernel: ACPI: PCI Root Bridge [PCI0] (00:00)
Dec 2 12:15:52 monkey08 kernel: ACPI: PCI Interrupt Link [LNKA] (IRQs 3
5 10 *11)
Dec 2 12:15:52 monkey08 kernel: ACPI: PCI Interrupt Link [LNKB] (IRQs 3
5 10 11) *0, disabled.
Dec 2 12:15:52 monkey08 kernel: ACPI: PCI Interrupt Link [LNKC] (IRQs 3
5 10 11) *0, disabled.
Dec 2 12:15:52 monkey08 kernel: ACPI: PCI Interrupt Link [LNKD] (IRQs 3
5 *10 11)
Dec 2 12:15:52 monkey08 kernel: PCI: Using ACPI for IRQ routing
Dec 2 12:15:52 monkey08 kernel: ACPI: PCI interrupt 0000:00:08.0[A] ->
GSI 20 (level, low) -> IRQ 20
Dec 2 12:15:52 monkey08 kernel: ACPI: PCI interrupt 0000:02:08.0[A] ->
GSI 19 (level, low) -> IRQ 19
Dec 2 12:15:52 monkey08 kernel: apm: overridden by ACPI.
Dec 2 12:15:52 monkey08 kernel: ACPI: (supports S0 S1 S4 S5)
Dec 2 12:15:52 monkey08 kernel: ACPI: Power Button (FF) [PWRF]
Dec 2 12:15:52 monkey08 kernel: ACPI: Sleep Button (FF) [SLPF]
Dec 2 12:15:52 monkey08 kernel: ACPI: PCI interrupt 0000:02:08.0[A] ->
GSI 19 (level, low) -> IRQ 19
I use exactly the same kernel and settings on an S2468UGN and
it is happy enough to reboot following a poweroff. These S2466
nodes used to reboot following poweroff (not with 100%
reliability, but much better than now) using RH 7.3.
Any ideas what this might be or how to fix it?
Thanks,
David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
More information about the Beowulf
mailing list