[scyld-users] Re: Scyld system mysteriously locks up
Tim Whitcomb
twhitcomb at apl.washington.edu
Thu Mar 18 12:20:01 PST 2004
> I purchased a 4 node, 8 processor Scyld (version 28) cluster
> approximately 6 months ago. About 5 days ago, it started mysteriously
> locking up on me. Once it is locked up, I can't do anything except
> physically reboot the machine.
> Unfortunately, I am rather new to Linux clusters and, since it worked
> "right out of the box", I have had no experience in troubleshooting.
> Can someone give me an idea of where I should start?
> I have the BIOS on all machines set to do a full memory check on startup
> and the /var/log/message file shows nothing.
> Thanks,
> Eric
This sounds suspiciously like a problem we've been fighting for the past
year at least. Are the machines actively running a job when they lock
up or are they sitting idle? I've done some tests that seem to suggest
that our system does not like the same job being run on both processors
of the same machine. Where did you purchase your equipment from, what
kind of processors are in it, what kind of interconnect are you using,
and what is the motherboard in the machines?
TRW
Timothy R. Whitcomb
===================
Meteorologist
Applied Physics Lab
University of Washington
mail: twhitcomb at apl dot washington dot edu
voice: (206) 543-2663
More information about the Scyld-users
mailing list