[Beowulf] Monitoring and Metrics
Benson Muite
benson.muite at ut.ee
Sun Oct 8 02:24:16 PDT 2017
May also be of interest:
JobDigest – Detailed System Monitoring-Based Supercomputer Application
Behavior Analysis
Dmitry Nikitenko, Alexander Antonov, Pavel Shvets, Sergey Sobolev,
Konstantin Stefanov, Vadim Voevodin, Vladimir Voevodin and Sergey Zhumatiy
http://russianscdays.org/files/pdf17/185.pdf
On 10/07/2017 04:13 PM, Paul Edmon wrote:
> So for general monitoring of the cluster usage we use:
>
> https://github.com/fasrc/slurm-diamond-collector
>
> and pipe to Graphana. We also use XDMod:
>
> http://open.xdmod.org/7.0/index.html
>
> As for specific node alerting, we use the old standby of Nagios.
>
> -Paul Edmon-
>
>
> On 10/7/2017 8:21 AM, Josh Catana wrote:
>> This may have been brought up in the past, but I couldn't find much in
>> my message archive.
>> What are people using for HPC cluster monitoring and metrics lately?
>> I've been low on time to add features to my home grown solution and
>> looking at some OTS products.
>> I'm looking for something that can do monitoring, alert on condition,
>> broken hardware, etc.
>> Also something that does system resource utilization metrics. If it
>> has a plug-in for a scheduling system like PBS where I can correlate a
>> job ID to the metrics of the systems it is currently running on or
>> previously ran on at the time, that would be an amazing plus.
>> Any of you beowulfers have any suggestions?
>>
>>
>> _______________________________________________
>> Beowulf mailing list,Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
----
Hajussüsteemide Teadur
Arvutiteaduse Instituut
Tartu Ülikool
J. Liivi 2, 50409
Tartu
http://kodu.ut.ee/~benson
----
Research Fellow of Distributed Systems
Institute of Computer Science
University of Tartu
J. Liivi 2 50409
Tartu, Estonia
http://kodu.ut.ee/~benson
More information about the Beowulf
mailing list