[Beowulf] Monitoring crashing machines

Loic Tortay tortay at cc.in2p3.fr
Tue Sep 9 08:52:16 PDT 2008


Carsten Aulbert wrote:
[server console management for many servers with conserver]
> 
We use conserver to get serial console access to almost all our machines.

Below is the forwarded answer to your messages from my coworker who's in
charge of this.
The tools he created for interfacing IPMI and conserver are in the
conserver "contrib" section (this may be what you refered to as the IPMI
interface for conserver).

If you want to contact him directly, his e-mail address is similar to
mine, juste replace 'tortay' with 'wernli'.


Loïc.

-------- Original Message --------

> Initially, conserver.com looked nice and we also found an IPMI
> interface for it, but that comes with two downsides: (1) it blocks
> IPMI access (I have yet to find out if a secondary user can use SoL
> when another user is using this already, but I doubt it) and (2) it
> simply does not catch messages appearing in dmesg (simple ones like
> plugging in a USB keyboard), but that may be a configuration problem
> on our side.

We are using conserver(.com) on 6 linux boxes (quite old horses) for
managing more than 1500 servers.  Most of the latter are being handled
by ipmitool SOL. On some - however rare - servers, I believe ipmi access
is indeed restricted to one open connection. If you happen to be unlucky
on this side (which I seriously doubt), it won't be an issue for the
console access, as conserver is designed to let you share these (while
logging all their output, which is what we're doing).

As for the dmesg issue, you're just missing the "console=ttySx,baudrate"
kernel parameter, which should come after "console=tty0" if you want
init to talk to the serial line, or before for speaking to the monitor.

> Also we tried (r)syslog but somehow this does not get all the messages
> either, even when using something like *.* @loghost.

this is however true, and is one of the reasons we got into the trouble
of having consoles (ipmi or other) open for all our servers at any time.
It can be very precious to grep through all the console logfiles to
catch that error message which was hidden everywhere else.

> For the time being we are experimenting with using "script" in many
> "screen" environment which should be able to monitor ipmitool's SoL
> output, but somehow that strikes me as inefficient as well.

conserver scales extremely well and will be your best friend (if you
don't have a dog that is).

> So, my question boils down to: How do people solve this problem?

feel free to private email me if you need the details


-- 
|       Loïc Tortay <tortay at cc.in2p3.fr> - IN2P3 Computing Centre      |



More information about the Beowulf mailing list