[Beowulf] Monitoring and Metrics

Sat Oct 7 06:13:18 PDT 2017

So for general monitoring of the cluster usage we use:

https://github.com/fasrc/slurm-diamond-collector

and pipe to Graphana.  We also use XDMod:

http://open.xdmod.org/7.0/index.html

As for specific node alerting, we use the old standby of Nagios.

-Paul Edmon-

On 10/7/2017 8:21 AM, Josh Catana wrote:
> This may have been brought up in the past, but I couldn't find much in 
> my message  archive.
> What are people using for HPC cluster monitoring and metrics lately? 
> I've been low on time to add features to my home grown solution and 
> looking at some OTS products.
> I'm looking for something that can do monitoring, alert on condition, 
> broken hardware, etc.
> Also something that does system resource utilization metrics. If it 
> has a plug-in for a scheduling system like PBS where I can correlate a 
> job ID to the metrics of the systems it is currently running on or 
> previously ran on at the time, that would be an amazing plus.
> Any of you beowulfers have any suggestions?
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20171007/7714539d/attachment.html>