Hangs
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Jean-Christophe Ducom jducom at nd.eduWed Jul 31 13:25:18 PDT 2002
- Previous message: Beowulf digest, Vol 1 #967 - 7 msgs
- Next message: 2U cases for dual MP1900+
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
The nodes of our cluster are:
Dell Workstation Dual Xeon 1.7GHz 1GB RAM, RedHat 7.2 running 2.4.18
patched for IRQ balancing, Syskonnect SK9D21 GigEthernet
The cluster is heavily used for mpi programs using MPICH 1.2.4
Each node mount NFS directories w/ the following options:
rw,nosuid,nodev,hard,intr,rsize=8192,wsize=8192
ACPI is installed to overcome some APM issues w/ the poweroff command on
SMP machines.
But some nodes hang sometimes for unknown reasons. They don't crash
though (they would reboot anyway: cat /proc/sys/kernel/panic -> 0 ).
There is no way to conect to them.
I installed serial console on some nodes (cf. my previous email about
remote serial console). When I connect thru the serial console to a hang
node, I even can't reboot the node BUT minicom shows that the machine is
ONLINE.
It happens most of the time when MPI programs establish communications
between nodes.
What's going on? NFS hangs (but nothing in the /var/log/message and
other)? ACPI problem? Does the console dies? Switch issues?
Any ideas?
Thanks
JC
- Previous message: Beowulf digest, Vol 1 #967 - 7 msgs
- Next message: 2U cases for dual MP1900+
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
