[Beowulf] Problems with Dell M620 and CPU power throttling

Tue Sep 17 17:49:00 PDT 2013

I recently received a hallway tip from a performance engineer at a major SW company: 

"Run in C1. C0 over commits unpredictably, then throttles."  

He didn't specify a hardware platform. 

AFK: mobile 

On Sep 17, 2013, at 8:06 PM, Richard Hickey <rahickey at nps.edu> wrote:

> Bill Wichser <bill <at> princeton.edu> writes:
> 
> 
> 
> 
>> Since January, when we installed an M620 Sandybridge cluster from Dell,
> 
>> we have had issues with power and performance to compute nodes.  Dell
> 
>> apparently continues to look into the problem but the usual responses
> 
>> have provided no solution.  Firmware, BIOS, OS updates all are fruitless.
> 
> 
>> The problem is that the node/CPU is power capped.  We first detected
> 
>> this with the STREAM benchmark, a quick run, which shows memory
> 
>> bandwidth around 2000 instead of the normal 13000 MB/s.  When the CPU is
> 
>> in the C0 state, this drops to around 600.
> 
> 
>> The effect appears randomly across the entire cluster with 5-10% of the
> 
>> nodes demonstrating some slower performance.  We don't know what
> 
>> triggers this.  Using "turbostat" we can see that the GHz of the cores
> 
>> is >= 1 in most cases, dropping to about 0.2 in some of the worst cases.
> 
>>  Looking at the power consumption by either the chassis GUI or using
> 
>> "impitool sdr list" we see that there is only about 80 watts being used.
> 
> 
>> We run the RH 6.x release and are up to date with kernel/OS patches.
> 
>> All firmware is up to date.  Chassis power is configured as 
> 
>> non-redundant.  tuned is set for performance.  Turbo mode is
> 
>> on/hyperthreading is off/performance mode is set in BIOS.
> 
> 
>> A reboot does not change this problem.  But a power cycle returns the
> 
>> compute node to normal again.  Again, we do not know what triggers this
> 
>> event.  We are not overheating the nodes.  But while applications are
> 
>> running, something triggers an event where this power capping takes
> effect.
> 
> 
>> At this point we remain clueless about what is causing this to happen.
> 
>> We can detect the condition now and have been power cycling the nodes in
> 
>> order to reset.
> 
> 
>> If anyone has a clue, or better yet, solved the issue, we'd love to hear
> 
>> the solution!
> 
> 
>> Thanks,
> 
>> Bill
> 
> 
> 
> 
> 
> 
> You are not alone in seeing this. We discovered it via by some of our 
> weather codes running slow. A co-worker started running single node Linpack 
> runs and we saw individual nodes running slow. A reboot did not fix, 
> however a power cycle did. We can see a 2 to 3 fold increase in 
> performance. 
> 
> 
> 
> We found that you could either do a physical reseat of the blade, or a 
> logical one through the cmc command line. Either way fixes the problem 
> temporarily.
> 
> 
> 
> It's good to see that someone else is seeing this. Well, maybe not good, 
> but at least we're not the only ones fighting this.
> 
> 
> 
> 
> 
> Rich 
> 
> 
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf