[Beowulf] Re: UPS & power supply instability

David Kewley kewley at gps.caltech.edu
Thu Sep 29 09:20:07 PDT 2005

On Wednesday 28 September 2005 23:04, Maurice Hilarius wrote:
> David Kewley wrote:
> > ..
> >
> >One quick update: We finally had a high-level engineering conference
> >call with Liebert this morning, at their instigation.  It was a very
> >good call, and they're being very helpful.  I hope there'll be a
> >workaround soon, but we may have to live with this problem for a while
> >yet...
> >
> >David
> That sounds like positive progress..
> Did they name a reason this is happening, or are they taking steps to
> send someone down with a scope to see what is happening?

Yes, they're sending people out.  Liebert engineers say they have basic 
understandings of what's going wrong, and have some possible ways to work 
around it or solve it.  I'll not say more at this time, to give them time 
to work it out.

> >P.S. I suppose you can guarantee that Hard Data power supplies would not
> >induce current oscillations in the room? :)  Didn't think so.
> Actually, if the room is wired correctly we can.
> We have been building servers and installations in server rooms since
> 1992, about 2 times as long as Dell have.
> And, we DO claim PFC compliance on our power supplies, and can produce
> test results and compliancy test reports to back that up.
> However, instead of trying to deflect from the real issue, let's look at
> what I replied to, word for word:

I apologize for making a snide comment.

> >Mind you, the "blame" may be shared by the
> >Liebert UPS and the Dell power supplies, but I'm relying on Liebert to
> >figure out why things go unstable *when their UPS is online, supplying
> >a load that should be quite normal*, and so far they have no solution
> >for me.  We can't just wait on Liebert; this problem is hamstringing
> >our use of our new 1024-node cluster.  So now I turn to this list.
> You have obviously decided, in advance, that the problem is with the
> Liebert equipment.

Maurice, in your two replies to this thread you've made lots of incorrect 
inferences and assumptions, including this one.  Please show me the respect 
of *not* assuming what I think, what I've done, or what others have done at 
our site for this problem.  If I've not *stated* some fact that you think 
is important, simply ask me rather than assuming.

> You mention absolutely nothing about testing the power supplies.
> That step should be the first, and fortunately is the easiest.
> Almost any modern scope will do the job.
> As it is low frequency it does not have to be an expensive or
> specialized scope.
> Instead of trusting "Kill-a-Watt clones" why not check the actual power
> supply response, on a standard 115V single phase power input circuit?

That's an excellent suggestion, and is in accord with my usual 
troubleshooting & experimental inclinations.  But because I have the 
responsibility for *all* the aspects of commissioning this brand-new, large 
cluster, I've had to leave lots of details to others, Liebert in this case.

To the best of my knowledge, Liebert has not studied these exact power 
supplies, but they say they understand PSes that are similar enough that 
they can work out a model of our specific problem.  Until I have time to 
run experiments myself, I am going to trust them to cover these bases.

> I have seen power regulation equipment fail in a similar fashion before,
> where the power supplies are pulling down too much current to the
> neutral phase,
> and making the power feed overload on one phase, driving it into
> instability.
> This is a classic symptom of cheap, poorly designed and made power
> supplies. Or bad room wiring, with undersized neutral lines.

The PDUs have a front panel that displays lots of diagnostic measurements, 
and they sound a rather piercing alarm when any measurement goes over its 
Liebert-defined limit (they are the only alarms I've heard in that room 
that can reliably be heard over the room noise, from any part of the 
room :).  The PDUs also have suitably sized breakers and suitably sized 
conductors on each of the 93 branch circuits.

The three output phase currents all stay well under their limits, even when 
they begin to become unstable (at the low-power end of the instability, and 
well into the instability domain).  Toward the high-power end of the 
instability domain that we've tested, the current oscillations become large 
enough, and sit on top of a large enough average current, so the PDUs *do* 
give overcurrent alarms (plus other alarms due to the wild oscillations).

Unless something is going on that is not alarmed for, the PDUs and the Liert 
techs who've been onsite don't indicate any problem with the neutral wiring 
or the power supplies per se.

> Liebert make big UPS and power units, and those are their "bread &
> butter"
> Frankly I am surprised they have not yet dispatched a tech down to your
> site with test equipment by now..

When did I say they haven't dispatched a tech to our site?  In fact they 
have, mutliple times; I just hadn't mentioned that up to this point in this 
thread.  My concern was not that they aren't sending techs, but that they 
have no solution yet, and that I wasn't getting a warm-fuzzy feeling that 
they really were treating this problem as critically as we need them to.

After yesterday's conference call, I feel better about their efforts.  Even 
so, the proof is still in the outcome, and the outcome is far from certain.

> When you say "Liebert has been on this case for something like 4 weeks
> now." what does that mean?

That's when we first demonstrated this problem to their onsite tech & 
engaged their help in solving it.

> >Can anyone here offer ideas, or better yet, experience?
> I was trying to.
> Apparently you do not appreciate suggestions, except ones that support
> your distrust of Liebert.

I appreciate all constructive suggestions.  My appreciation does not extend 
to insinuations.

Thanks for trying to offer ideas & experience.  I *do* appreciate some of 
what you've written in this email.  I appreciate *none* of what you wrote 
in your first reply to this thread -- if you like, go back and read it and 
see if you can understand why.

> Why not test the power supplies?
> If doing it yourself is not something you are comfortable with, there
> are many electrical inspection labs in your region that provide this
> service, usually for under $150.
> Look in the yellow pages under "testing" or similar.
> Many will allow you to stand there and watch and ask questions as they
> do it.

Now *that* is a very good suggestion.  Thank you.  I did not know testing 
could be this easy.  (By the way, I'm comfortable with testing / measuring 
the power supplies, although I don't have the equipment on hand to do it 
properly, and I don't have the full range of knowledge to interpret all of 
what I measure.)

For now, I'm going to continue to let Liebert run with this problem; we've 
offered to get them a power supply to take apart and/or measure, but so far 
they seem to believe they understand it well enough.  I'm also going to 
trust Dell, that their power supplies are of good quality, just of poor 
interaction with the rest of our power infrastructure.

Meanwhile, I have several other things to take care of on the cluster, 
before users can get more than minimal use out of it, so I'm not yet going 
to get into detailed measurements myself.


More information about the Beowulf mailing list