Beowulf and Big Brother

Leif Nixon nixon at nsc.liu.se
Wed Nov 13 01:51:58 PST 2002


canon at pookie.nersc.gov writes:

> We are using netsaint/nagios to monitor our cluster (a little over
> 300 nodes). Netsaint works well for monitoring services and basic
> host responds. [...] We just recently started using ganglia to
> monitor performance.

Sadly, there doesn't seem to be a good way of getting Nagios to
monitor data from Ganglia. You can feed the data into Nagios as
passive service checks, which sort of works, but you can't do passive
host checks.

So, if you have several clusters and want Nagios to notify you if a
node dies, you need to set up Nagios in a distributed configuration,
with a Nagios server on each cluster's front-end. That really is a
pain, since you have to duplicate much of the configuration between
the central Nagios server and the distributed ones. Or rather, you
need to duplicate it *and* subtly change it. I started to write
scripts to do this in an automated fashion, but after a while threw my
hands up in disgust. No fun.

I'm having some thoughts about hacking monitoring abilities into
Ganglia, but haven't gotten around to actually doing anything about it
yet.

-- 
Leif Nixon                                    Systems expert
------------------------------------------------------------
National Supercomputer Centre           Linkoping University
------------------------------------------------------------



More information about the Beowulf mailing list