[Beowulf] gpu+server health monitoring -- ensure system cooling
Eliot Eshelman
eliote at microway.com
Tue Jun 9 15:20:33 PDT 2015
I can confirm that Ganglia supports the Tesla K80 GPU monitoring just fine.
Regarding GPU temperatures, I'm seeing ~60C in one of NVIDIA's
officially-certified servers for Tesla K80 (4U Supermicro
SYS-7048GR-TR). You might not want to use Tesla K20/K40 as comparisons,
because they had lower levels of GPU Boost (and thus might not push the
TDP envelope as much).
Best,
Eliot
On 06/08/2015 12:07 AM, Kevin Abbey wrote:
> Thank you each for the notes. The current host bios/bmc appears to
> read data from a MIC card but not the Nvidia. I'm considering to find
> a method to simply force an increased fan speed in the server for jobs
> using the gpu. I'll also ask intel again if they can help, perhaps
> with a custom sdr file. I assume they have done this on their current
> generation of hardware which would hopefully be portable to a
> sandybrige board.
>
>
> Are there published average running temperatures of gpu: k20, k40, k80?
>
> nvidia-smi reported 66C during a few test jobs. This is below the
> power throttle temperature on the gpu, but the utilization was still
> below 75%.
>
> Thanks, I'll check for the ECC errors too.
> Kevin
>
>
> On 6/7/2015 9:14 PM, Paul McIntosh wrote:
>> we use nvidia-smi also
>>
>> You should also keep an eye out for GPU ECC errors as we have found
>> these are good predictors of bad things happening due to heat.
>> Generally you should see none.
>>
>> In the past we had major issues with the node heat sensors being
>> designed around detecting CPU heat and not the GPU's living in the
>> same box. A firmware upgrade fixed the issue but the ECC checks where
>> the thing that best found the problem nodes.
>>
>> Cheers,
>>
>> Paul
>>
>>
>> ----- Original Message -----
>> From: "Michael Di Domenico" <mdidomenico4 at gmail.com>
>> To: "Beowulf Mailing List" <Beowulf at beowulf.org>
>> Sent: Monday, 8 June, 2015 7:50:40 AM
>> Subject: Re: [Beowulf] gpu+server health monitoring -- ensure system
>> cooling
>>
>> nvidia-smi will also show the current temperature of the card. you
>> could script it to save the results over time. it even includes xml
>> output if you're savvy at parsing it
>>
>> On Sat, Jun 6, 2015 at 10:09 AM, Adam DeConinck <ajdecon at ajdecon.org>
>> wrote:
>>> Hi Kevin,
>>>
>>> nvidia-healthmon is the tool I've used for this kind of thing in the
>>> past.
>>> It can do temperature checks as well as some sanity checks for
>>> things like
>>> PCIe connectivity.
>>>
>>> http://docs.nvidia.com/deploy/healthmon-user-guide/index.html
>>>
>>> For more general monitoring (I.e. compute and memory usage), I've used
>>> Ganglia with the NVML plugins. Not sure how well maintained these are
>>> though.
>>>
>>> https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia
>>>
>>> Adam
>>>
>>>
>>> On Friday, June 5, 2015, Kevin Abbey <kevin.abbey at rutgers.edu> wrote:
>>>> Hi,
>>>>
>>>> I recently installed a Nvidia K80 gpu in a server. Can anyone share
>>>> methods and procedures for monitoring and ensuring the card is cooled
>>>> sufficiently by the server fans? I need to set this up and test
>>>> before
>>>> running any compute tests.
>>>>
>>>>
>>>> Thanks,
>>>> Kevin
>>>>
>>>> --
>>>> Kevin Abbey
>>>> Systems Administrator
>>>> Center for Computational and Integrative Biology (CCIB)
>>>> http://ccib.camden.rutgers.edu/
--
Eliot Eshelman
Microway, Inc.
More information about the Beowulf
mailing list