[Beowulf] Problems with Dell M620 and CPU power throttling

Bill Wichser bill at princeton.edu
Fri Aug 30 06:03:18 PDT 2013


Since January, when we installed an M620 Sandybridge cluster from Dell, 
we have had issues with power and performance to compute nodes.  Dell 
apparently continues to look into the problem but the usual responses 
have provided no solution.  Firmware, BIOS, OS updates all are fruitless.

The problem is that the node/CPU is power capped.  We first detected 
this with the STREAM benchmark, a quick run, which shows memory 
bandwidth around 2000 instead of the normal 13000 MB/s.  When the CPU is 
in the C0 state, this drops to around 600.

The effect appears randomly across the entire cluster with 5-10% of the 
nodes demonstrating some slower performance.  We don't know what 
triggers this.  Using "turbostat" we can see that the GHz of the cores 
is >= 1 in most cases, dropping to about 0.2 in some of the worst cases. 
  Looking at the power consumption by either the chassis GUI or using 
"impitool sdr list" we see that there is only about 80 watts being used.

We run the RH 6.x release and are up to date with kernel/OS patches. 
All firmware is up to date.  Chassis power is configured as 
non-redundant.  tuned is set for performance.  Turbo mode is 
on/hyperthreading is off/performance mode is set in BIOS.

A reboot does not change this problem.  But a power cycle returns the 
compute node to normal again.  Again, we do not know what triggers this 
event.  We are not overheating the nodes.  But while applications are 
running, something triggers an event where this power capping takes effect.

At this point we remain clueless about what is causing this to happen. 
We can detect the condition now and have been power cycling the nodes in 
order to reset.

If anyone has a clue, or better yet, solved the issue, we'd love to hear 
the solution!

Thanks,
Bill


More information about the Beowulf mailing list