Louis J. Romero louisr at aspsys.com
Thu Aug 1 07:49:04 PDT 2002


I'd have two (2) suggestions that might help.  

The first would be to put a head on the machine so that you are not blind 
when the system hangs.  If that is not an option, load up a cron job that 
runs maybe every 5 minutes (you'll have to throttle this to give you a period 
that can give you info @ or around the time of the hang  I would suggest that 
you maybe dump a long listing of the process table e.g. ps -wef --forest., 
maybe socket info via netstat -a or socklist, nfs data using netstat, disk 
stats using iostat, swap using free, mounted file systems can be viewed using 
df (note: put this command in last because if nfs is the culprit, the df 
command will hang).  If you happen to know the process that is hanging, run 
it with strace with the output going to a file.  May slow things down a bit 
but, you're in triage @ this point.

As an aside, why are the nfs mount points hard?  nfs problems with a hard 
mount option can cause a machine to hang.  Depending upon the load that the 
clients are putting on the server, increasing the number of nfs daemons may 
relieve some botleneck that may be introduced.  Conversely, too many can 
cause performance degradation.

Good luck...


On Wednesday 31 July 2002 02:25 pm, Jean-Christophe Ducom wrote:
> The nodes of our cluster are:
> Dell Workstation Dual Xeon 1.7GHz 1GB RAM, RedHat 7.2 running 2.4.18
> patched for IRQ balancing, Syskonnect SK9D21 GigEthernet
> The cluster is heavily used for mpi programs using MPICH 1.2.4
> Each node mount NFS directories w/ the following options:
> rw,nosuid,nodev,hard,intr,rsize=8192,wsize=8192
> ACPI is installed to overcome some APM issues w/ the poweroff command on
> SMP machines.
> But some nodes hang sometimes for unknown reasons. They don't crash
> though (they would reboot anyway: cat /proc/sys/kernel/panic  -> 0 ).
> There is no way to conect to them.
> I installed serial console on some nodes (cf. my previous email about
> remote serial console). When I connect thru the serial console to a hang
> node, I even can't reboot the node BUT minicom shows that the machine is
> It happens most of the time when MPI programs establish communications
> between nodes.
> What's going on? NFS hangs (but nothing in the /var/log/message and
> other)? ACPI problem? Does the console dies? Switch issues?
> Any ideas?
> Thanks
>                 JC
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf

Louis J. Romero
Email: louisr at aspsys.com
Local: (303) 431-4606

Aspen Systems, Inc.
3900 Youngfield Street
Wheat Ridge, Co 80033
Toll Free: (800) 992-9242
Fax: (303) 431-7196
URL: http://www.aspsys.com

More information about the Beowulf mailing list