[Beowulf] using Nagios to monitor compute nodes: NPRE vs check_by_ssh

Mon Dec 22 20:23:46 PST 2008

At my employer, we use a variety of monitoring tools for our various
clusters. Our nagios box is a VM with a single processor and 512MB of
memory. Currently, we monitor 1700 hosts, each with three or four
service checks a piece (two of which SSH to nodes to run scripts). We
check services about every 30 minutes.

The load on the central box does get up there are at times, but it is
generally responsive and there's not much additional network load.

We chose SSH based checks because we were already running Ganglia for
statistics monitoring on the nodes and no one wanted to maintain yet
another daemon.. It seemed like the best option for us.

Best of luck with your cluster monitoring!
Alex Younts

On Mon, Dec 22, 2008 at 8:28 PM, Rahul Nabar <rpnabar at gmail.com> wrote:
> I just installed Nagios to try and monitor my 256 compute nodes
> centrally. It seems to work like a charm for all the public services
> (ping, ssh etc.) but now I was getting more ambitious and wanted to
> try to monitor the private services too (disk usage; process loads;
> torque ; pbs etc.).
>
> I was just confused whether (1) to use the NPRE plugin (seems like a
> pain to deploy onto all 256 nodes) or (2) go via the check_by_ssh
> route. (I already have paswordless logins from master-nodes to
> slave-nodes)
>
> I'd like (2) because it is more secure and seems easier to deploy but
> I'm a bit afraid if this will overtax my central server.
>
> Any suggestions? Are other users using Nagios here?
>
> --
> Rahul