custom hardware (was: Xbox clusters?)

David Vos dvos12 at calvin.edu
Wed Nov 28 14:11:09 PST 2001


On Wed, 28 Nov 2001, John Burton wrote:
> Ummmm....speak for yourself. I've been putting together these "self
> assembled beige box" for many years and currently have about 5%
> component DOA rate, and about another 1% infant mortality rate (crap
> out within 30 days).  Takes on average 4 hours to determine what the
> bad component is and 24-48 hours to replace it. I've never spent more
> than 1 week "figuring out" which part is broken. The time I spent 1
> week was due to a flakey memory chip that was causing filesystem
> errors in a 90GB RAID 5 array.  Flakey memory is difficult to track
> down because it can masquerade as virtually anything else...

There is one computer in our cluster that would make me think twice before
doing a custom build.  I prefer to call it the node from heck.  It only
has one problem: it won't boot.  If you press the power button, the
powerlight flashes while the cpu and case fans turn a quarter turn, then
nothing.  You have to wait a minute before you even get that reaction
again.  (Sounds like a short somewhere).  The problem only surfaces if the
computer has been off for a little while, and nearly every time at that.

1st Occurance (several months ago).  Try new power supply.  No go.  
Remove drives, cards, etc. from motherboard until only (new) PS(power
supply), Motherboard, Mem, and CPU.  Nope.  Swap mem.  Nope.  Swap CPU.
Nope.  Sounds like the motherboard (I replaced everything else).  I return
the original parts (and drop a screwdriver on the motherboard by
accident), and it suddenly starts working.  I put computer back in and it
runs fine with everything the way it was before.

2nd Occurance (a month or so later).  I knew it was a bad motherboard last
time, so I replaced the motherboard.  Worked great.

3rd Occurance (a month or so later).  I take things apart and put them
back together.  Starts working.  Now I'm starting to get confused.

4th Occurance (a month or so later).  I remove drives and cards, put in
spare PS.  Nothing.  Remove motherboard and put on a piece of wood with
nothing attached but spare PS, CPU, and mem (using a screw driver to short
pins instead of power switch).  Used a new power cable plugged into a
different circuit.  Nothing.  Try new mem.  Get another system and
individually check mem, motherboard, cpu.  They are all good.  Try both
PS's in other system and problem follows them.  Two bad powersupplies --
not too unusual.  I replace them, and things run great.

5th Occurance (recently).  I removed all cards, drives from
motherboard.  Nothing.  Tried spare PS.  That worked.  Unplugged current
PS from case, HD, FD, it started working.  Put everything back together
and it was still working.

Since there is not a single piece of hardware that was present in each
case, I feel forced to conclude that there must be something (power cord?)  
that is braking the power supplies.  I have not seen this problem on any
other computers.  This is the point at which I would love to put the whole
computer back in a box and send it to the reseller.

Luckily we never sent back the "bad" motherboard and keep it around as a
spare, since it works fine on other systems, now.

David




More information about the Beowulf mailing list