custom hardware (was: Xbox clusters?)
Bob Drzyzgula
bob at drzyzgula.org
Thu Nov 29 07:02:00 PST 2001
On Thu, Nov 29, 2001 at 09:15:15AM +0100, Daniel Pfenniger wrote:
>
> David Vos wrote:
> >
> ....
> > There is one computer in our cluster that would make me think twice before
> > doing a custom build. I prefer to call it the node from heck. It only
> > has one problem: it won't boot. If you press the power button, the
> > powerlight flashes while the cpu and case fans turn a quarter turn, then
> > nothing. You have to wait a minute before you even get that reaction
> > again. (Sounds like a short somewhere). The problem only surfaces if the
> > computer has been off for a little while, and nearly every time at that.
>
> I have seen similar strange behavior of some boxes in a set of 66's, and the
> way to restart is also rather odd.
> Basically, and this has been repeatedly observed on several boxes of the same
> composition (dual Pentium III with ASUS P2BD motherboard) aligned on a metallic
> shelf, the ATX box would stop after months of activity, and the simplest found
> way to restart it is to unplug everything (power and ethernet), touch it for
> a few seconds with hands, replug and voila. No need to open the box!
> My guess is that some condensator needs to be unloaded, but exactly why
> one needs to unplug every cable appears curious.
One thing to understand is that, unless there is a physical
switch on the power supply itself, ATX systems are never
*really* turned off as long as they are plugged in -- they
only go to a "standby" state, wherein +5V power is still
being applied to a single pin (the purple wire). When you
press the power button on the front of the chassis, it
merely shorts a header that ultimately causes the
motherboard to short the green wire in the ATX cable to
ground -- this is a signal to the power supply to leave
standby and start generating power for all the other
outputs.
Another thing to observe is that generally, ATX power
supplies are switching supplies, which means that (to
simplify things somewhat) they generate the correct voltage
by charging and discharging a capacitor at a high rate. The
switching controller constantly monitors the voltage on the
capacitor and connects or disconnects the capacitor to the
incoming supply, depending on whether the charge is above or
below the desired level (the detailed truth behind this is
fairly complex and typically involves multiple stages and
inductors as well as capacitors, but this model is probably
good enough for this discussion...). Thus, even when an ATX
system is "off", the power supply is chugging along, keeping
a capacitor charged to provide +5V at a low current. BTW, if
you have the resources to do this, put a current sensor on
the incoming AC line for a running system and feed the
output to an oscilloscope. You should see a series of
alternating positive and negative spikes -- those are the
capacitors charging at the peaks and troughs of the AC
voltage.
Now, if the ATX board were simply to run the green-wire
contact straight through to the power on/off header, you
wouldn't need much oomph at all on the +5V standby line, and
older ATX power supplies in fact didn't. However, newer
boards have things like Wake-on-LAN, Wake-on-Modem, and
other various and sundry goodies that have to run off the
+5V standby. It has gotten to the point that, in order to
do all the processing that is required to leave standby, the
standby current draw is greater than what some older
supplies can provide. So in the case of a power supply that
either by design or fault cannot provide sufficient current
under standby, what (I think) happens is that while the
motherboard is waiting for the main supply voltages to come
up to full power, the standby processing bleeds off the
capacitor to the point that the standby voltage sags below
the minimum required for operation. At that point, the
standby processing halts, the motherboard stops holding the
green wire to ground, and the power supply stops trying to
power up. It then returns to standby mode, re-charges the
standby capacitor, and the cycle begins again.
If you have a system that is behaving like this, try putting
a voltmeter on the standby pin of the ATX header (you can
usually jab a probe down into the back of the connector).
You should see it at +5V when the system is "off". Then
press the system's "on" button and watch the voltage. You'll
most likely see it sag down to a couple of volts or so. If
this doesn't happen, you've probably got some other problem,
perhaps a POST failure of some sort. Also, this may not be
the end of the diagnosis -- it is possible that the failure
to provide enough current on standby may not be the fault of
the power supply itself. It could be a faulty componant
(e.g. the SCSI drive we heard about) sucking down too much
current on power-up, or an overburdened AC supply circuit
that sags just a bit when your system starts up -- in the
latter case I imagine that you could wind up with a
seemingly jinxed spot in the equipment rack. :-)
BTW, if the power supply has too little oomph on standby by
*design*, the system will probably *never* power up. If the
supply's design meets the new spec only marginally, or if it
is malfunctioning, say, because of a damaged or weakened
capacitor, then it might behave differently when cold than
it does when it is fully warmed up. In this event,
unplugging the supply for a while and reconnecting it can
create a short window in which the supply can get the system
over the hump to leave standby. I in fact have a supply at
home that has this problem, and I just sort of live with it
because it's not my main system. Someday perhaps I'll
replace the supply.
As to why you have to disconnect the Ethernet as well, I
really don't have a clue.
HTH,
--Bob Drzyzgula
More information about the Beowulf
mailing list