[Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose?

Mark Hahn hahn at mcmaster.ca
Sat Oct 3 14:01:42 PDT 2009

> monolithic all-or-none creature. From what you write (and my online
> reading) it seems there are several discrete parts:
> IMPI 2.0
> switched remotely accessible PDUs
> "serial concentrator type system "

I think Joe was going a bit belt-and-suspenders-and-suspenders here.

ipmi normally provides out-of-band access to the system's I2C bus
(which lets one power on/off, reset, and read the sensors.)  it also
normally provides some form of console access: usually this is by 
serial redirection (serial output can be redirected through the BMC
and onto the net).  independent of this (but usually also provided)
is a bios feature which scrapes the video character array onto serial,
thus giving access to bios output (and also technically independent
but also provided is lan->bmc->serial->bios "keyboard" input.)

some people also configure systems with network-aware PDUs (power bars):
APC is a common provider of these, and they provide a backup if IPMI
doesn't work for some reason (network problems, hung BMC, etc).
I do not personally think they are worthwhile because I rarely see 
IPMI problems - admittedly perhaps due to the fairly narrow range of 
parts my organization has.  smart PDUs sometimes also provide power 
montoring, which might be useful, though I would actually prefer to 
see IPMI merely provide current sensors via I2C (in addition to volts).
(having both socket power and motherboard power might be amusing, though,
since you could calculate your PSU's efficiency - potentially even its 
load-efficiency curve.  most vendors now quote 92-93% efficiency, but 
it's unclear what load range that's for...)

finally, I think Joe is advocating another layer of backups - serial 
concentrators that would connect to the console serial port on each node
to collect output if IPMI SOL isn't working.  this is perhaps a matter 
of taste, but I don't find this terribly useful.  I thought it would be 
for my first cluster, but never actually set it up.  but again, that's
because IPMI works well in my experience.

I think Joe's right in the sense that you _don't_ want a cluster without
working power control, and working post/console redirection is pretty 
valuable as well.  both become more critical with larger cluster sizes,
mainly because the chances grow of hitting a problem where you need 
power/reset/console control.  whether you need backup systems past IPMI
is unclear - depends on whether your IPMI works well.

> Correct me if I am wrong but these are all "options" and varying
> vendors and implementations  will offer parts or all or none of these?
> Or is it that when one says "IPMI 2" it includes all these features. I

I interpreted Joe as saying that you need IPMI2 (remote power/reset/console)
as well as backup mechanisms for IPMI failures.

> hard to translate jargon across vendors. e.g. for Dell they are called
> DRAC's etc.

vendors provide IPMI features, usually with added proprietary nonsense. 
sometimes they sacrifice parts of IPMI in favor of the proprietary crap...

> Finally, what's  a"serial concentrator"? Isn't that the same as the
> SOL that Skylar was explaining to me? Or is that something different
> too?

a network-accessible box into which many serial ports plug.  some let you 
transform a serial port into a syslog stream, for instance.

More information about the Beowulf mailing list