[Beowulf] Problems with Dell M620 and CPU power throttling
Douglas O'Flaherty
douglasof at gmail.com
Tue Sep 17 17:49:00 PDT 2013
I recently received a hallway tip from a performance engineer at a major SW company:
"Run in C1. C0 over commits unpredictably, then throttles."
He didn't specify a hardware platform.
AFK: mobile
On Sep 17, 2013, at 8:06 PM, Richard Hickey <rahickey at nps.edu> wrote:
> Bill Wichser <bill <at> princeton.edu> writes:
>
>
>
>
>> Since January, when we installed an M620 Sandybridge cluster from Dell,
>
>> we have had issues with power and performance to compute nodes. Dell
>
>> apparently continues to look into the problem but the usual responses
>
>> have provided no solution. Firmware, BIOS, OS updates all are fruitless.
>
>
>> The problem is that the node/CPU is power capped. We first detected
>
>> this with the STREAM benchmark, a quick run, which shows memory
>
>> bandwidth around 2000 instead of the normal 13000 MB/s. When the CPU is
>
>> in the C0 state, this drops to around 600.
>
>
>> The effect appears randomly across the entire cluster with 5-10% of the
>
>> nodes demonstrating some slower performance. We don't know what
>
>> triggers this. Using "turbostat" we can see that the GHz of the cores
>
>> is >= 1 in most cases, dropping to about 0.2 in some of the worst cases.
>
>> Looking at the power consumption by either the chassis GUI or using
>
>> "impitool sdr list" we see that there is only about 80 watts being used.
>
>
>> We run the RH 6.x release and are up to date with kernel/OS patches.
>
>> All firmware is up to date. Chassis power is configured as
>
>> non-redundant. tuned is set for performance. Turbo mode is
>
>> on/hyperthreading is off/performance mode is set in BIOS.
>
>
>> A reboot does not change this problem. But a power cycle returns the
>
>> compute node to normal again. Again, we do not know what triggers this
>
>> event. We are not overheating the nodes. But while applications are
>
>> running, something triggers an event where this power capping takes
> effect.
>
>
>> At this point we remain clueless about what is causing this to happen.
>
>> We can detect the condition now and have been power cycling the nodes in
>
>> order to reset.
>
>
>> If anyone has a clue, or better yet, solved the issue, we'd love to hear
>
>> the solution!
>
>
>> Thanks,
>
>> Bill
>
>
>
>
>
>
> You are not alone in seeing this. We discovered it via by some of our
> weather codes running slow. A co-worker started running single node Linpack
> runs and we saw individual nodes running slow. A reboot did not fix,
> however a power cycle did. We can see a 2 to 3 fold increase in
> performance.
>
>
>
> We found that you could either do a physical reseat of the blade, or a
> logical one through the cmc command line. Either way fixes the problem
> temporarily.
>
>
>
> It's good to see that someone else is seeing this. Well, maybe not good,
> but at least we're not the only ones fighting this.
>
>
>
>
>
> Rich
>
>
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list