[Beowulf] gpu+server health monitoring -- ensure system cooling

Tue Jun 9 15:20:33 PDT 2015

I can confirm that Ganglia supports the Tesla K80 GPU monitoring just fine.

Regarding GPU temperatures, I'm seeing ~60C in one of NVIDIA's 
officially-certified servers for Tesla K80 (4U Supermicro 
SYS-7048GR-TR). You might not want to use Tesla K20/K40 as comparisons, 
because they had lower levels of GPU Boost (and thus might not push the 
TDP envelope as much).

Best,
Eliot

On 06/08/2015 12:07 AM, Kevin Abbey wrote:
> Thank you each for the notes.  The current host bios/bmc appears to 
> read data from a MIC card but not the Nvidia.  I'm considering to find 
> a method to simply force an increased fan speed in the server for jobs 
> using the gpu.  I'll also ask intel again if they can help, perhaps 
> with a custom sdr file.  I assume they have done this on their current 
> generation of hardware which would hopefully be portable to a 
> sandybrige board.
>
>
> Are there published average running temperatures of gpu: k20, k40, k80?
>
> nvidia-smi reported 66C during a few test jobs.  This is below the 
> power throttle temperature on the gpu, but the utilization was still 
> below 75%.
>
> Thanks, I'll check for the ECC errors too.
> Kevin
>
>
> On 6/7/2015 9:14 PM, Paul McIntosh wrote:
>> we use nvidia-smi also
>>
>> You should also keep an eye out for GPU ECC errors as we have found 
>> these are good predictors of bad things happening due to heat. 
>> Generally you should see none.
>>
>> In the past we had major issues with the node heat sensors being 
>> designed around detecting CPU heat and not the GPU's living in the 
>> same box. A firmware upgrade fixed the issue but the ECC checks where 
>> the thing that best found the problem nodes.
>>
>> Cheers,
>>
>> Paul
>>
>>
>> ----- Original Message -----
>> From: "Michael Di Domenico" <mdidomenico4 at gmail.com>
>> To: "Beowulf Mailing List" <Beowulf at beowulf.org>
>> Sent: Monday, 8 June, 2015 7:50:40 AM
>> Subject: Re: [Beowulf] gpu+server health monitoring -- ensure system 
>> cooling
>>
>> nvidia-smi will also show the current temperature of the card. you
>> could script it to save the results over time.  it even includes xml
>> output if you're savvy at parsing it
>>
>> On Sat, Jun 6, 2015 at 10:09 AM, Adam DeConinck <ajdecon at ajdecon.org> 
>> wrote:
>>> Hi Kevin,
>>>
>>> nvidia-healthmon is the tool I've used for this kind of thing in the 
>>> past.
>>> It can do temperature checks as well as some sanity checks for 
>>> things like
>>> PCIe connectivity.
>>>
>>> http://docs.nvidia.com/deploy/healthmon-user-guide/index.html
>>>
>>> For more general monitoring (I.e. compute and memory usage), I've used
>>> Ganglia with the NVML plugins. Not sure how well maintained these are
>>> though.
>>>
>>> https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia
>>>
>>> Adam
>>>
>>>
>>> On Friday, June 5, 2015, Kevin Abbey <kevin.abbey at rutgers.edu> wrote:
>>>> Hi,
>>>>
>>>> I recently installed a Nvidia K80 gpu in a server. Can anyone share
>>>> methods and procedures for monitoring and ensuring the card is cooled
>>>> sufficiently by the server fans?  I need to set this up and test 
>>>> before
>>>> running any compute tests.
>>>>
>>>>
>>>> Thanks,
>>>> Kevin
>>>>
>>>> -- 
>>>> Kevin Abbey
>>>> Systems Administrator
>>>> Center for Computational and Integrative Biology (CCIB)
>>>> http://ccib.camden.rutgers.edu/

-- 
Eliot Eshelman
Microway, Inc.