[Beowulf] Geriatric computer does not stay up
Jack Carrozzo
jack at crepinc.com
Wed Dec 16 14:36:05 PST 2009
I assume you've done this but forgot to mention it in the email - did
you test the RAM?
-Jack Carrozzo
On Wed, Dec 16, 2009 at 5:27 PM, David Mathog <mathog at caltech.edu> wrote:
> So we have a cluster of Tyan S2466 nodes and one of them has failed in
> an odd way. (Yes, these are very old, and they would be gone if we had a
> replacment.) On applying power the system boots normally and gets far
> into the boot sequence, sometimes to the login prompt, then it locks up.
> If booted failsafe it will stay up for tens of minutes before locking.
> It locked once on "man smartctl" and once on "service network start".
> However, on the next reboot, it didn't lock with another "man smartctl",
> so it isn't like it hit a bad part of the disk and died. Smartctl test
> has not been run, but "smartctl -a /dev/hda" on the one disk shows it as
> healthy with no blocks swapped out. Power stays on when it locks, and
> the display remains as it was just before the lock. When it locks it
> will not respond to either the keyboard or the network. (The network
> interface light still flashes.) There is nothing in any of the logs to
> indicate the nature of the problem.
>
> The odd thing is that the system is remarkably stable in some ways. For
> instance, the PS tests good and heat isn't the issue: after running
> sensors in a tight loop to a log file, waiting for it to lock up, then
> looking at the log on the next failsafe boot, there were negligible
> fluctuation on any of the voltages, fan speeds, or temperatures. It
> will happily sit for 30 minutes in the BIOS, or hours running memtest86
> (without errors). The motherboard battery is good, and the inside of
> the case is very clean, with no dust visible at all. Reset the BIOS but
> it didn't change anything.
>
> Here are my current hypotheses for what's wrong with this beast:
>
> 1. The drive is failing electrically, puts voltage spikes out on some
> operations, and these crash the system.
> 2. The motherboard capacitors are failing and letting too much noise in.
> The noise which is fatal is only seen on an active system, so sitting
> in the BIOS or in Memtest86 does not do it. (But the caps all look good,
> no swelling, no leaks.) It will run memtest86 overnight though, just in
> case.
> 3. The PS capacitors are failing, so that when loaded there is enough
> voltage fluctuation to crash the system. (Does not agree very well with
> the sensors measurements, but it could be really high frequency noise
> superimposed on a steady base voltage.)
> 4. Evil Djinn ;-(
>
> Any thoughts on what else this might be?
>
> Thanks.
>
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
More information about the Beowulf
mailing list