[Beowulf] Geriatric computer does not stay up
Eric Thibodeau
kyron at neuralbs.com
Mon Dec 21 11:05:45 PST 2009
This smells like the hell I went through when one of the CPUs needed to be changed in our dep's Tyan VX50... Try swapping CPUs if you have spares.
ET
On 2009-12-16, at 5:36 PM, Jack Carrozzo wrote:
> I assume you've done this but forgot to mention it in the email - did
> you test the RAM?
>
> -Jack Carrozzo
>
> On Wed, Dec 16, 2009 at 5:27 PM, David Mathog <mathog at caltech.edu> wrote:
>> So we have a cluster of Tyan S2466 nodes and one of them has failed in
>> an odd way. (Yes, these are very old, and they would be gone if we had a
>> replacment.) On applying power the system boots normally and gets far
>> into the boot sequence, sometimes to the login prompt, then it locks up.
>> If booted failsafe it will stay up for tens of minutes before locking.
>> It locked once on "man smartctl" and once on "service network start".
>> However, on the next reboot, it didn't lock with another "man smartctl",
>> so it isn't like it hit a bad part of the disk and died. Smartctl test
>> has not been run, but "smartctl -a /dev/hda" on the one disk shows it as
>> healthy with no blocks swapped out. Power stays on when it locks, and
>> the display remains as it was just before the lock. When it locks it
>> will not respond to either the keyboard or the network. (The network
>> interface light still flashes.) There is nothing in any of the logs to
>> indicate the nature of the problem.
>>
>> The odd thing is that the system is remarkably stable in some ways. For
>> instance, the PS tests good and heat isn't the issue: after running
>> sensors in a tight loop to a log file, waiting for it to lock up, then
>> looking at the log on the next failsafe boot, there were negligible
>> fluctuation on any of the voltages, fan speeds, or temperatures. It
>> will happily sit for 30 minutes in the BIOS, or hours running memtest86
>> (without errors). The motherboard battery is good, and the inside of
>> the case is very clean, with no dust visible at all. Reset the BIOS but
>> it didn't change anything.
>>
>> Here are my current hypotheses for what's wrong with this beast:
>>
>> 1. The drive is failing electrically, puts voltage spikes out on some
>> operations, and these crash the system.
>> 2. The motherboard capacitors are failing and letting too much noise in.
>> The noise which is fatal is only seen on an active system, so sitting
>> in the BIOS or in Memtest86 does not do it. (But the caps all look good,
>> no swelling, no leaks.) It will run memtest86 overnight though, just in
>> case.
>> 3. The PS capacitors are failing, so that when loaded there is enough
>> voltage fluctuation to crash the system. (Does not agree very well with
>> the sensors measurements, but it could be really high frequency noise
>> superimposed on a steady base voltage.)
>> 4. Evil Djinn ;-(
>>
>> Any thoughts on what else this might be?
>>
>> Thanks.
>>
>> David Mathog
>> mathog at caltech.edu
>> Manager, Sequence Analysis Facility, Biology Division, Caltech
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list