Hi Kevin,<div><br></div><div>nvidia-healthmon is the tool I've used for this kind of thing in the past. It can do temperature checks as well as some sanity checks for things like PCIe connectivity.</div><div><br></div><div><a href="http://docs.nvidia.com/deploy/healthmon-user-guide/index.html">http://docs.nvidia.com/deploy/healthmon-user-guide/index.html</a></div><div><br></div><div>For more general monitoring (I.e. compute and memory usage), I've used Ganglia with the NVML plugins. Not sure how well maintained these are though. </div><div><br></div><div><a href="https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia">https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia</a><br></div><div><br></div>Adam<br><div><br>On Friday, June 5, 2015, Kevin Abbey <<a href="mailto:kevin.abbey@rutgers.edu">kevin.abbey@rutgers.edu</a>> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>

<br>

I recently installed a Nvidia K80 gpu in a server. Can anyone share methods and procedures for monitoring and ensuring the card is cooled sufficiently by the server fans?  I need to set this up and test before running any compute tests.<br>

<br>

<br>

Thanks,<br>

Kevin<br>

<br>

-- <br>

Kevin Abbey<br>

Systems Administrator<br>

Center for Computational and Integrative Biology (CCIB)<br>

<a href="http://ccib.camden.rutgers.edu/" target="_blank">http://ccib.camden.rutgers.edu/</a><br>

 <br>

_______________________________________________<br>

Beowulf mailing list, <a>Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>

To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a><br>

</blockquote></div>