[Beowulf] UPS & power supply instability
kewley at gps.caltech.edu
Tue Sep 27 17:32:18 PDT 2005
I wonder whether anyone has seen the problem that we're seeing with our
cluster's electrical supply.
We have a Liebert 600 Series 500kVA UPS feeding two Liebert PDUs. The
PDUs then have a fanout of whips to the computer racks.
The UPS voltage in/out and the PDU voltage in is 480V 3ph. The PDU out
is 208V 3ph +neutral, 120V wrt neutral. The whips are 5-conductor
(3ph, ground, neutral), and they feed APC AP7960 switchable rack PDUs.
The computers are fed 120V from the AP7960s.
Our compute nodes, the main power load, are Dell PowerEdge 1850s with a
single power supply per node. This power supply is
power-factor-corrected, so the Liebert PDUs see a power factor of 0.99
I've balanced the loads on the three phases about as well as possible.
We still have neutral current, about 1/3 to 1/2 the magnitude of any of
the per-phase currents.
The problem is this: We can fire up our cluster to about 40% of maximum
load and everything is fine. But if we go over some threshold right
around 40% of max, the output currents from the PDUs go unstable. It's
a fairly sharp edge: Approximately speaking, if I stay below the
threshold, the current variation is <1%. But if go to the top end of
the stable range, then add another ~2% load, the output currents vary
over something like 30%. The instability gets worse with increasing
load above the threshold. Reducing the load below the threshold
restores stability (with perhaps a slight bit of hystereticity).
This instability only happens when the UPS is online. If we put the UPS
in bypass, we can go up to around 70% of max load with no instability
(all computers on but idling in the OS; we haven't tested all nodes at
100% CPU yet).
We suspect the problem is due to some interaction between the computer
power supplies and the output stage of the UPS. Perhaps the UPS isn't
regulating correctly with this load. Or perhaps it's regulating *too
well*, and the rock-solid voltages allow the oscillations to grow
instead of damp. I don't know.
Liebert has been on this case for something like 4 weeks now. So far
they have no solution. Mind you, the "blame" may be shared by the
Liebert UPS and the Dell power supplies, but I'm relying on Liebert to
figure out why things go unstable *when their UPS is online, supplying
a load that should be quite normal*, and so far they have no solution
for me. We can't just wait on Liebert; this problem is hamstringing
our use of our new 1024-node cluster. So now I turn to this list.
Can anyone here offer ideas, or better yet, experience?
More information about the Beowulf