[Beowulf] Problems with Dell M620 and CPU power throttling

Fri Aug 30 10:54:02 PDT 2013

Have you tried measuring the AC power with a Kill-A-Watt type device (or from your power distribution unit).. That's a lot easier than trying to measure the DC power going to the processor, and might tell you something useful, diagnostics wise.  If the offending unit shows a different AC power consumption before hiccupping, for instance.

Jim Lux

-----Original Message-----
From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Bill Wichser
Sent: Friday, August 30, 2013 9:20 AM
To: Mark Hahn
Cc: beowulf at beowulf.org
Subject: Re: [Beowulf] Problems with Dell M620 and CPU power throttling

On 08/30/2013 12:00 PM, Mark Hahn wrote:
>> Of course we have done system tuning.
>
> sorry for the unintentional condescenscion - what I actually meant was 
> "tuning of knobs located in /sys" :)
>
>> Instrumenting temperature probes on individual CPUs has not been 
>> performed. When we look at temperatures from both the chassis and 
>> ipmitool, we see no drastic peaks.  Maybe we are getting a 60C peak 
>> that we don't detect and that is the cause.  But I doubt it.
>
> could you try "modprobe coretemp", and see whether interesting things 
> appear under:
> /sys/devices/system/cpu/cpu*/thermal_throttle/core_throttle_count
>
> afaik, reading the coretemp*/temp*_input values would let you do 
> higher-resolution monitoring to see whether you're getting spikes.

We have these already loaded and see values:

[root at r2c3n4 thermal_throttle]# ls
core_power_limit_count  core_throttle_count  package_power_limit_count package_throttle_count
[root at r2c3n4 thermal_throttle]# cat *
18781048
0
18781097
0

This was what led us to how the chassis was limiting power.  We had been using redundancy and switched to non-redundant to try and eliminate.  We believe that we see these messages when the CPU is throttling up in power.  From the google oracle,  these messages are benign.  Perhaps that isn't so...

>
>> power consumption is around 80W.  That tells me that the system is 
>> cool enough.  Should I not believe those values?  i have no reason to 
>> from past experience.
>
> I'm not casting aspersions, just that chassis temps don't tell the 
> whole story.  is your exact model of CPU actually rated for higher power?
> we've got some ProLiant SL230s Gen8 with E5-2680's - rated for 130, 
> and don't seem to be throttling.

These are  E5-2670 0 @ 2.60GHz.  Two per node.

>
>> Input air is about 22C.  For our data center, you'd have a better 
>> chance of getting this adjusted to 15C than I would!  As for fans, 
>> these don't have
>
> yes, well, it is nice to have one's own datacenter ;) but seriously, I 
> find it sometimes makes a difference to open front and back doors of 
> the rack (if any), do some manual sampling of air flow and 
> temperatures (wave hand around)...
>
>> For heat sink thermal grease problems, I'd expect this to be visible 
>> using the ipmitools but maybe that is not where the temperatures are 
>> being measured.  I don't know about that issue.  I'd expect that a 
>> bad thermal grease issue would manifest itself by showing up on a per 
>> socket level and not on both sockets.  It seems odd that every node 
>> exhibiting this problem would have both sockets having the same issue.
>
> well, if both sockets have poor thermal contact with heatsinks...
> I'm not trying to FUD up any particular vendor(s), but mistakes do happen.
> I was imagining, for instance, that an assembly line might be set up 
> with HS and thermal compound tuned for E5-2637 systems (80W/socket), 
> but was pressed into service for some E5-2690 nodes (135W).

I'd expect to see the bad nodes be bad nodes consistently.  They have been mostly moving targets at this point, randomly distributed.

>
>> Again, the magnitude of the problem is about 5-10% at any time.  
>> Given
>> 600
>
> if I understand you, the prevalence is only 5-10%, but the magnitude
> (effect)
> is much larger, right?

Right.  We have many jobs which use 32 nodes or more.  Anytime a node goes bad, the whole job starts to crawl along thus tying up resources for days instead of hours.

Bill

>
> regards, mark.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf