ch_p4 Error -> System Hangs
Donald Becker
becker at scyld.com
Tue Nov 6 07:55:35 PST 2001
On Tue, 6 Nov 2001, Chadalavada Kalyana Krishna wrote:
> I am working on a 7 node Linux Cluster ( 6 compute
> nodes , 1 FS).
What system? (Kernel version, etc.)
> system from which the program was started, hung. I
> could not trace out the source to any s/w problem or
> installation, though I am not sure about it.
>
> Repeated attempts to run the same resulted in hanging
> of n09, n11, n13,n14, n15. I was not able to Ping to
> the systems. But, I also do not understand why n10 did
> not hang though I ran the program there too.
>
> Ths display is :
>
> Code: some numbres.
>
> Alicee: Killed Interrupt handler
You have a kernel crash. Given that it didn't occur on all systems, you
should look first for a hardware problem, especially memory corruption.
> One important point is that we have configured mpich
> to use ssh instead of rsh for communication.
This is likely not related to a kernel crash.
Donald Becker becker at scyld.com
Scyld Computing Corporation http://www.scyld.com
410 Severn Ave. Suite 210 Second Generation Beowulf Clusters
Annapolis MD 21403 410-990-9993
More information about the Beowulf
mailing list