[Beowulf] RRDtools graphs of temp from IPMI

Sat Nov 8 21:45:18 PST 2008

Gerry,

Like others, I too use ganglia - and have a custom script which reports 
cpu temps (and fan speeds) for the nodes. However, I changed the default 
method of communication for ganglia (multicast) to reduce the chatter. I 
use a unicast setup, where each node reports directly to the monitoring 
server - which is a dedicated machine for monitoring all the systems - 
and performing other tasks (dhcp, ntp, imaging, etc)

Each node is using less than 1KB/sec to transmit all the ganglia 
information, including my extra metrics. For the useful recording 
information you get from this data its worth the rather small network 
chatter. You can tune the metrics further, turn off the ones you don't 
want, or have them report less often.

I'd suggest installing it, if you still think it is chatty, then remove 
it and look for another option. I find it useful in that you can see 
when a node died, what the load on the node was when it crashed, what 
the network traffic is, etc...

I also use cacti - but only for the head servers, switches, etc. I find 
it has too much over head for the nodes. It is however useful in that it 
can send emails to alert you to problems, and allows for graphing of 
SNMP devices.

Craig.

Gerry Creager wrote:
> Now, for the flame-bait.  Bernard suggests cacti and/or ganglia to 
> handle this.  Our group have heard some mutterings that ganglia is a 
> "chatty" applicaiton and could cause some potential hits on or 1 Gbe 
> interconnect fabric.
>
> A little background on our current implementation:  126 dual-quad core 
> Xeon Dell 1950's interconnected with gigabit ethernet.  No, it's not 
> the world's best MPI machine, but it should... and does... perform 
> admirably for throughput applications where most jobs can be run on a 
> node (or two) but which don't use MPI as much as, e.g., OpenMP, or in 
> some cases, even run on a single core but use all the RAM.
>
> So, we're worried a bit about having everything talk on the same 
> gigabit backplane, hence, so far, no ganglia.
>
> What are the issues I might want to worry about in this regard, 
> especially as we expand this cluster to more nodes (potentially going 
> to 2k cores, or, essentially doubling?