[Beowulf] gpu+server health monitoring -- ensure system cooling

Adam DeConinck ajdecon at ajdecon.org
Sat Jun 6 07:09:39 PDT 2015


Hi Kevin,

nvidia-healthmon is the tool I've used for this kind of thing in the past.
It can do temperature checks as well as some sanity checks for things like
PCIe connectivity.

http://docs.nvidia.com/deploy/healthmon-user-guide/index.html

For more general monitoring (I.e. compute and memory usage), I've used
Ganglia with the NVML plugins. Not sure how well maintained these are
though.

https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia

Adam

On Friday, June 5, 2015, Kevin Abbey <kevin.abbey at rutgers.edu> wrote:

> Hi,
>
> I recently installed a Nvidia K80 gpu in a server. Can anyone share
> methods and procedures for monitoring and ensuring the card is cooled
> sufficiently by the server fans?  I need to set this up and test before
> running any compute tests.
>
>
> Thanks,
> Kevin
>
> --
> Kevin Abbey
> Systems Administrator
> Center for Computational and Integrative Biology (CCIB)
> http://ccib.camden.rutgers.edu/
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20150606/2b103d5d/attachment.html>


More information about the Beowulf mailing list