[Beowulf] Tyan 2466 crashes, no obvious reason why

David Mathog mathog at mendel.bio.caltech.edu
Tue Oct 12 11:38:10 PDT 2004

Just thought I'd share the final outcome of this.

After much swapping around of components and days of 
running memtest86 the problem was moving with the power
supply.  Swapping in the spare PS fixed it and that node
has not so much as hiccupped in the month since.

Note in particular that all of the voltages seen
by the motherboard were always in range.  My working hypothesis
is that the PS either passes too much noise or just
glitches occasionally (for instance, an intermittant
internal short).

The PS was a Zippy power supply with a power cord that
attached via spades to the socket at the back of the 2U case.

  model   AX2-5300FB-2S
  P/N     6AX2-300B055
  ser no: T21905564M1A977732

Big EMACS loggy, tiny www.zippy.com.tw down at the bottom.
It was still under Zippy's warranty and the good folks
at PSSC handled the exhange promptly.

A day (!) after the replacement unit came in a
second node started doing the exact same thing -
unexplained crashes and lock ups with nothing in the
log file.  Logging lm_sensors every 2 minutes showed nothing
untoward up through the last entry.  Crashes were
every few hours.  This time I just swapped the PS first
thing and it has been ok now for over 4 days.  Same
type of power supply inside, this one with 
Serial No. T21905562M1A977732, which differs by only one digit from
the first one that failed. 

Could be a coincidence but I'm beginning to suspect that there
may be a bad component in this lot of power supplies, in which
case an unpleasant series of node failures can probably be expected
in the not too distant future.


David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

More information about the Beowulf mailing list