[Beowulf] UPS & power supply instability
kewley at gps.caltech.edu
Thu Sep 29 09:46:39 PDT 2005
On Thursday 29 September 2005 07:46, Mark Hahn wrote:
> > We have a Liebert 600 Series 500kVA UPS feeding two Liebert PDUs. The
> > PDUs then have a fanout of whips to the computer racks.
> the PDU's are just simple networks, not matching transformers or
> harmonic mitigators? (for our new ~2k cpus machineroom, the local
> physical plant people required us to put in Liebert harmonic mitigators,
> even though we told them all PS's would be PFC. from HP, if that
I'm afraid I don't know that level of detail (yet :). I believe the Liebert
folks said yesterday that our PDUs have surge suppression circuitry, but no
other filter caps. And I believe the transformer is a simple isolation &
stepdown, not filtering. Certainly there are no caps except associated
with the surge suppressor.
Someone else on this thread mentioned K20 transformers, which are designed
to tolerate relatively high harmonics. These PDU transformers are indeed
K20, and if I recall correctly, the THD measured by the PDUs is well under
the K20 limit. Certainly the PDUs haven't alarmed on THD, nor have the
onsite techs rasied any issue with the THD.
> > The UPS voltage in/out and the PDU voltage in is 480V 3ph. The PDU out
> > is 208V 3ph +neutral, 120V wrt neutral. The whips are 5-conductor
> > (3ph, ground, neutral), and they feed APC AP7960 switchable rack PDUs.
> > The computers are fed 120V from the AP7960s.
> it shouldn't be relevant, but did you choose against 208 to the nodes
> for a reason? (nearly everything is auto-ranging nowadays, and tends
> to run a little more efficiently at 208).
We now wish we had 208. Unfortunately, when the room was designed & the
power infrastructure built, we anticipated having an unknown mix of
equipment in the room, some of which might not handle 208 gracefully.
(Actually the "we" in the design phase didn't include me; I arrived after
the room was fully built but not yet populated with computers.)
By the way, I have no previous experience with medium or large data centers,
so I had no idea until recently that it was common to supply 208 to
machines. 120 had always been sufficient in my earlier experience.
> > I've balanced the loads on the three phases about as well as possible.
> > We still have neutral current, about 1/3 to 1/2 the magnitude of any of
> > the per-phase currents.
> yow! isn't that very high? we had an anti-neutral-current squad on
> campus earlier this year, and they freaked out over our old machineroom
> which had neutral that was about 10% of the others...
Yeah, it seems high to me. The PDUs aren't alarming on it though, and the
Liebert folks haven't raised any concerns about it.
I have to learn more about where this current is likely coming from, but I
think I'm hearing that it could be caused by harmonics?
> > The problem is this: We can fire up our cluster to about 40% of maximum
> > load and everything is fine. But if we go over some threshold right
> > around 40% of max, the output currents from the PDUs go unstable. It's
> "fire up" means power on at the same time? what happens if you sneak up
> the load (say, one node per minute to be conservative.)? I'm wondering
> whether part of your problem is inrush/spinup load.
At this moment, we don't het have power-up/down automated over the network,
so we actually go press all the buttons. It's not as bad as it sounds --
it takes me probably 20 seconds to hit the buttons on a rack of 40
computers, or 2 nodes per second roughly.
One node per minute would take most of a day to power up the whole cluster,
so that's out. :) In my brief discussions with others, 2/second seemed OK
-- inrush should take less than 1/2 second, right?
These computers & power supplies have standby power. The inrush from
"unplugged" to "plugged in", that is, the standby power inrush, is 3A max.
I've not measured whether there is any inrush associated with powering up
fully, but I'd expect that not to matter much after .5 second.
As for hard drive spinup, I can't imagine the additional 20W or less per
node would matter that much at 2 nodes per second. Our room has been
powered up to about 300kW on UPS bypass without a hitch.
> > This instability only happens when the UPS is online. If we put the
> > UPS in bypass, we can go up to around 70% of max load with no
> > instability (all computers on but idling in the OS; we haven't tested
> > all nodes at 100% CPU yet).
Yeah. Tell me about it. :/ Until this problem is solved, we have two
a) Run fulltime with only 40% capacity.
b) Run overnight with 40% capacity with the UPS online, and 100% during the
workday in UPS bypass. Cross our fingers that our luck for the past year
and a half holds, and we get no significant power events. Oh, and make
backups religiously (which should go without saying anyway).
More information about the Beowulf