[Beowulf] Monitoring crashing machines
carsten.aulbert at aei.mpg.de
Tue Sep 9 00:53:48 PDT 2008
I would tend to guess this problem is fairly common and many solutions
are already in place, so I would like to enquirer about your solutions
to the problem:
In our large cluster we have certain nodes going down with I/O hard disk
errors. We have some suspicion about the causes but would like to
investigate this further. However, the log files don't show much if
anything at all (which is understandably given that the log files reside
on disk and we are hitting I/O disk errors). Albeit the console shows
some interesting messages but cannot scroll back long enough.
My question now, is there a cute little way to gather all the console
outputs of > 1000 nodes? The nodes don't have physical serial cables
attached to them - nor do we want to use many concentrators to achieve
this - but the off-the-shelf Supermicro boxes all have an IPMI card
installed and SoL works quite ok.
Initially, conserver.com looked nice and we also found an IPMI interface
for it, but that comes with two downsides: (1) it blocks IPMI access (I
have yet to find out if a secondary user can use SoL when another user
is using this already, but I doubt it) and (2) it simply does not catch
messages appearing in dmesg (simple ones like plugging in a USB
keyboard), but that may be a configuration problem on our side.
Also we tried (r)syslog but somehow this does not get all the messages
either, even when using something like *.* @loghost.
For the time being we are experimenting with using "script" in many
"screen" environment which should be able to monitor ipmitool's SoL
output, but somehow that strikes me as inefficient as well.
So, my question boils down to: How do people solve this problem?
Thanks a lot
Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics
Callinstrasse 38, 30167 Hannover, Germany
Phone/Fax: +49 511 762-17185 / -17193
http://www.top500.org/system/9234 | http://www.top500.org/connfam/6/list/31
More information about the Beowulf