[Beowulf] Geriatric computer does not stay up

Jack Carrozzo jack at crepinc.com
Wed Dec 16 14:36:05 PST 2009


I assume you've done this but forgot to mention it in the email - did
you test the RAM?

-Jack Carrozzo

On Wed, Dec 16, 2009 at 5:27 PM, David Mathog <mathog at caltech.edu> wrote:
> So we have a cluster of Tyan S2466 nodes and one of them has failed in
> an odd way. (Yes, these are very old, and they would be gone if we had a
> replacment.)  On applying power the system boots normally and gets far
> into the boot sequence, sometimes to the login prompt, then it locks up.
>  If booted failsafe it will stay up for tens of minutes before locking.
>  It locked once on "man smartctl" and once on "service network start".
> However, on the next reboot, it didn't lock with another "man smartctl",
> so it isn't like it hit a bad part of the disk and died.  Smartctl test
> has not been run, but "smartctl -a /dev/hda" on the one disk shows it as
> healthy with no blocks swapped out.  Power stays on when it locks, and
> the display remains as it was just before the lock.  When it locks it
> will not respond to either the keyboard or the network.  (The network
> interface light still flashes.)  There is nothing in any of the logs to
> indicate the nature of the problem.
>
> The odd thing is that the system is remarkably stable in some ways.  For
> instance, the PS tests good and heat isn't the issue: after running
> sensors in a tight loop to a log file, waiting for it to lock up, then
> looking at the log on the next failsafe boot, there were negligible
> fluctuation on any of the voltages, fan speeds, or temperatures.  It
> will happily sit for 30 minutes in the BIOS, or hours running memtest86
> (without errors).  The motherboard battery is good, and the inside of
> the case is very clean, with no dust visible at all.  Reset the BIOS but
> it didn't change anything.
>
> Here are my current hypotheses for what's wrong with this beast:
>
> 1. The drive is failing electrically, puts voltage spikes out on some
> operations, and these crash the system.
> 2. The motherboard capacitors are failing and letting too much noise in.
>  The noise which is fatal is only seen on an active system, so sitting
> in the BIOS or in Memtest86 does not do it. (But the caps all look good,
> no swelling, no leaks.)  It will run memtest86 overnight though, just in
> case.
> 3. The PS capacitors are failing, so that when loaded there is enough
> voltage fluctuation to crash the system.  (Does not agree very well with
> the sensors measurements, but it could be really high frequency noise
> superimposed on a steady base voltage.)
> 4. Evil Djinn ;-(
>
> Any thoughts on what else this might be?
>
> Thanks.
>
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>




More information about the Beowulf mailing list