[Beowulf] Remote console management
David Kewley
kewley at gps.caltech.edu
Sun Sep 25 15:33:26 PDT 2005
On Friday 23 September 2005 09:05, Jerker Nyberg wrote:
> I am currently installing some Dell 1850 (with remote access cards) and
> HP DL140/DL380 and it would be great with some input from someone on the
> integrated remote access in a Linux environment. Remote console and
> reset/on/off is good enough for me.
I am bringing up a large cluster of PE 1850s right now. Dell offers Linux
command line tools to change most BIOS & BMC settings from within the host
OS.
The command-line non-ipmi tools are part of Dell's OpenManage free product.
This has been excellent for tracking down e.g. memory errors during the
cluster burn-in period. From the master node I simply do:
shmux -m -c "omreport system esmlog" - < /ml/all-1024 > junk
grep Descr junk | egrep -v "(Ambient Temp|log cleared|Intrusion)" \
sort | uniq -c
This give me an output like this:
1 compute-11-38.local: Description : ECC Error Correction detected on Bank 3 DIMM B
1 compute-12-7.local: Description : ECC Error Correction detected on Bank 3 DIMM B
1 compute-15-24.local: Description : correctable memory error logging disabled
6 compute-15-24.local: Description : ECC Error Correction detected on Bank 1 DIMM B
2 compute-17-37.local: Description : ECC Error Correction detected on Bank 3 DIMM B
2 compute-22-26.local: Description : ECC Error Correction detected on Bank 1 DIMM B
375 compute-22-33.local: Description : ECC Error Correction detected on Bank 2 DIMM A
333 compute-22-34.local: Description : ECC Error Correction detected on Bank 3 DIMM A
4 compute-23-16.local: Description : ECC Error Correction detected on Bank 2 DIMM B
3 compute-23-22.local: Description : ECC Error Correction detected on Bank 3 DIMM A
20 compute-24-1.local: Description : ECC Error Correction detected on Bank 1 DIMM A
1 compute-25-26.local: Description : ECC Error Correction detected on Bank 2 DIMM B
103 compute-25-29.local: Description : ECC Error Correction detected on Bank 2 DIMM B
1 compute-26-1.local: Description : ECC Error Correction detected on Bank 3 DIMM B
18 compute-31-26.local: Description : ECC Error Correction detected on Bank 1 DIMM B
2 compute-32-10.local: Description : correctable memory error logging disabled
12 compute-32-10.local: Description : ECC Error Correction detected on Bank 3 DIMM B
1 compute-32-19.local: Description : BMC Riser PG voltage sensor state asserted
1 compute-32-19.local: Description : BMC Riser PG voltage sensor state deasserted
3 compute-32-22.local: Description : ECC Error Correction detected on Bank 1 DIMM B
1 compute-35-18.local: Description : correctable memory error logging disabled
13 compute-35-18.local: Description : ECC Error Correction detected on Bank 3 DIMM B
2 compute-37-15.local: Description : correctable memory error logging disabled
12 compute-37-15.local: Description : ECC Error Correction detected on Bank 1 DIMM A
10 compute-42-30.local: Description : ECC Error Correction detected on Bank 2 DIMM A
2 compute-42-33.local: Description : ECC Error Correction detected on Bank 1 DIMM B
1 compute-43-19.local: Description : correctable memory error logging disabled
11 compute-43-19.local: Description : ECC Error Correction detected on Bank 2 DIMM A
1 compute-43-5.local: Description : ECC Error Correction detected on Bank 1 DIMM B
1 compute-46-31.local: Description : ECC Error Correction detected on Bank 2 DIMM B
279 compute-47-40.local: Description : ECC Error Correction detected on Bank 1 DIMM B
1 compute-47-9.local: Description : ECC Error Correction detected on Bank 2 DIMM A
Now I know I need to replace at least 4 specific sticks of RAM. (This
doesn't mean "Dell RAM" is bad -- we have 6144 sticks in our compute
nodes, and I believe we're getting around 1-2% initial failure.)
You can report many things with omreport, configure things with omconfig,
and run many diagnostics with omdiag. All these tools are launched from
within the target's host Linux OS.
Note: I grep out "Ambient Temp" because our room has a tendency to be colder
than Dell's default warning threshold. :) I'll be changing that threshold
using omconfig very soon.
As far as IPMI, Dell offers ipmish, with which you can do e.g a forced
power-off on a machine remotely (and outside the machine's OS) with e.g.
this command from your management station:
ipmish -ip 192.168.0.100 -u root -p <password> power off -force
This works great -- I can troubleshoot node boot-ups and installs from the
comfort of home.
Dell also offers an IPMI Serial Over Lan tool, but I find it clunky. I look
forward to trying the open-source ipmitool package for SOL and other
functions.
David
More information about the Beowulf
mailing list