Hangs

Jean-Christophe Ducom jducom at nd.edu
Wed Jul 31 13:25:18 PDT 2002


The nodes of our cluster are:
Dell Workstation Dual Xeon 1.7GHz 1GB RAM, RedHat 7.2 running 2.4.18
patched for IRQ balancing, Syskonnect SK9D21 GigEthernet

The cluster is heavily used for mpi programs using MPICH 1.2.4
Each node mount NFS directories w/ the following options:
rw,nosuid,nodev,hard,intr,rsize=8192,wsize=8192

ACPI is installed to overcome some APM issues w/ the poweroff command on
SMP machines.

But some nodes hang sometimes for unknown reasons. They don't crash
though (they would reboot anyway: cat /proc/sys/kernel/panic  -> 0 ).
There is no way to conect to them.
I installed serial console on some nodes (cf. my previous email about
remote serial console). When I connect thru the serial console to a hang
node, I even can't reboot the node BUT minicom shows that the machine is
ONLINE.
It happens most of the time when MPI programs establish communications
between nodes.
What's going on? NFS hangs (but nothing in the /var/log/message and
other)? ACPI problem? Does the console dies? Switch issues?

Any ideas?

Thanks

                JC





More information about the Beowulf mailing list