ch_p4 Error -> System Hangs

Donald Becker becker at scyld.com
Tue Nov 6 07:55:35 PST 2001


On Tue, 6 Nov 2001, Chadalavada Kalyana Krishna wrote:

> I am working on a 7 node Linux Cluster ( 6 compute
> nodes , 1 FS).

What system?  (Kernel version, etc.)

> system from which the program was started, hung. I
> could not trace out the source to any s/w problem or
> installation, though I am not sure about it.
> 
> Repeated attempts to run the same resulted in hanging
> of n09, n11, n13,n14, n15. I was not able to Ping to
> the systems. But, I also do not understand why n10 did
> not hang though I ran the program there too.
> 
> Ths display is :
> 
> Code: some numbres.
> 
> Alicee: Killed Interrupt handler

You have a kernel crash.  Given that it didn't occur on all systems, you
should look first for a hardware problem, especially memory corruption.

> One important point is that we have configured mpich
> to use ssh instead of rsh for communication.

This is likely not related to a kernel crash.

Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993




More information about the Beowulf mailing list