Myrinet hardware reliability
Robert G. Brown
rgb at phy.duke.edu
Fri Feb 7 09:56:54 PST 2003
On Fri, 7 Feb 2003, Victoria Pennington wrote:
> We have a 113 node IBM x330 cluster with Myrinet 2000. We're
> experiencing very high failure rates on Myrinet switch ports
> (average 3 per month) and on Myrinet NICs to a lesser extent
> (about 1 per month). Ports and NICs are fine one minute,
> then one or the other just dies (for good). Cables
> (fibre, not copper) seem fine - one or two failures only in
> nearly a year.
> There is no pattern in the failures, and they are entirely
> unrelated to usage levels; seldom used nodes are just as
> likely to have failures as heavily used nodes.
I'd at least suspect wiring and spikes. There was a long discussion on
ways to MISwire computer rooms on the list a few months ago that you
might look up in the list archives; within it are some URL's to sites
that describe some of the problems that can occur when one plugs lots of
nodes into multiphase circuits that e.g. share a common neutral.
In a nutshell you can be building up a significant neutral line voltage
and browning out your supply voltage during the critical middle third of
each half-phase when switching power supplies tend to draw all their
power. This harmonic distortion of the supply voltage can cause all
sorts of problems, premature and inexplicable component failure being
one of them. Since systems experiencing it tend to run with their power
supply capacitors inadequately charged, it can significantly reduce the
ability of those power supplies to filter out spikes. Even surge
protectors don't eliminate all of the problems that can be caused.
You might check up on just how the room was wired. In particular, look
at the voltage between neutral and ground at the receptacles where all
the nodes are plugged in. If it is as high as a few volts, you may have
a problem, especially if it is high on the circuit where the systems are
located and not high on the circuit where the switch is located. If you
have an oscilloscope, you can look at the actual supply line voltage at
the receptacles loaded and unloaded, to see how badly distorted the wave
Solutions (if this turns out to be your problem):
a) NEVER share a neutral wire between three phases in a computer room.
The load isn't resistive and doesn't have a power factor near one, and
it is actually dangerous to do so (it overheats the neutral and the main
supply transformer). Run a separate neutral for each phase.
b) Try to keep the runs as short as possible and use heavy gauge wire.
The neutral line voltage depends on the current it carries and its
resistance. Resistance increases with the length of the run, decreases
with the cross-sectional area of the wire.
c) Use power factor corrected power supplies if possible, or a
harmonic correction supply transformer for the entire space (there are
companies that would love to sell you one).
d) Some people on the list suggested that a UPS would probably help.
It seems like an expensive solution compared to running additional
neutrals, and nearly as expensive as getting a harmonic correction
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
More information about the Beowulf