[Beowulf] Problems with Dell M620 and CPU power throttling

Thu Sep 5 05:51:56 PDT 2013

Sigh.  I'm back with the same issues.

We moved the power management entirely into the OS by changing BIOS and 
chassis settings.  We now can see more info using the OS acpi commands. 
  This isn't a state issue.  We added the kernel parameters
intel_idle.max_cstate=0 processor.max_cstate=1 to eliminate the Intel 
throttles turning it over to the OS.  Still same issues.

turbostat shows core speeds at >1.2 GHz.  i7z shows that temps are below 
60C.  When the cores drift below the 2.6GHz we have, we send values with 
cpupower command or by trying to directly set using 
/sys/devices/system/cpu/cpuX/cpufreq and while the change is taken, no 
actual change ever takes place.

cpupower  frequency-info

shows, initially,
current policy: frequency should be within 1.20 GHz and 2.60 GHz
current CPU frequency is 2.60 GHz (asserted by call to hardware).

starting an HPL run immediately takes this to
  current policy: frequency should be within 1.20 GHz and 2.20 GHz.
  current CPU frequency is 1.60 GHz (asserted by call to hardware).

This is a "good" node.  Others drop to about 0.2GHz.  Again i7z shows 
temps in the mid 50C range.

ipmitool sdr shows
Pwr Consumption  | 192 Watts         | ok
Current          | 0.80 Amps         | ok

On a normal node the power is upwards of 350W.

We are trying to escalate with Dell but that process is SLOW!

Thanks,
Bill

On 08/30/2013 09:03 AM, Bill Wichser wrote:
> Since January, when we installed an M620 Sandybridge cluster from Dell,
> we have had issues with power and performance to compute nodes.  Dell
> apparently continues to look into the problem but the usual responses
> have provided no solution.  Firmware, BIOS, OS updates all are fruitless.
>
> The problem is that the node/CPU is power capped.  We first detected
> this with the STREAM benchmark, a quick run, which shows memory
> bandwidth around 2000 instead of the normal 13000 MB/s.  When the CPU is
> in the C0 state, this drops to around 600.
>
> The effect appears randomly across the entire cluster with 5-10% of the
> nodes demonstrating some slower performance.  We don't know what
> triggers this.  Using "turbostat" we can see that the GHz of the cores
> is >= 1 in most cases, dropping to about 0.2 in some of the worst cases.
>    Looking at the power consumption by either the chassis GUI or using
> "impitool sdr list" we see that there is only about 80 watts being used.
>
> We run the RH 6.x release and are up to date with kernel/OS patches.
> All firmware is up to date.  Chassis power is configured as
> non-redundant.  tuned is set for performance.  Turbo mode is
> on/hyperthreading is off/performance mode is set in BIOS.
>
> A reboot does not change this problem.  But a power cycle returns the
> compute node to normal again.  Again, we do not know what triggers this
> event.  We are not overheating the nodes.  But while applications are
> running, something triggers an event where this power capping takes effect.
>
> At this point we remain clueless about what is causing this to happen.
> We can detect the condition now and have been power cycling the nodes in
> order to reset.
>
> If anyone has a clue, or better yet, solved the issue, we'd love to hear
> the solution!
>
> Thanks,
> Bill
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>