[Beowulf] Monitoring and Metrics
Paul Edmon
pedmon at cfa.harvard.edu
Sat Oct 7 06:13:18 PDT 2017
So for general monitoring of the cluster usage we use:
https://github.com/fasrc/slurm-diamond-collector
and pipe to Graphana. We also use XDMod:
http://open.xdmod.org/7.0/index.html
As for specific node alerting, we use the old standby of Nagios.
-Paul Edmon-
On 10/7/2017 8:21 AM, Josh Catana wrote:
> This may have been brought up in the past, but I couldn't find much in
> my message archive.
> What are people using for HPC cluster monitoring and metrics lately?
> I've been low on time to add features to my home grown solution and
> looking at some OTS products.
> I'm looking for something that can do monitoring, alert on condition,
> broken hardware, etc.
> Also something that does system resource utilization metrics. If it
> has a plug-in for a scheduling system like PBS where I can correlate a
> job ID to the metrics of the systems it is currently running on or
> previously ran on at the time, that would be an amazing plus.
> Any of you beowulfers have any suggestions?
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20171007/7714539d/attachment.html>
More information about the Beowulf
mailing list