[Beowulf] NFSv3 client hangs - tcp v/s udp.

Wed May 3 15:21:07 PDT 2006

Cluster Details:
================
o  648 single processor Intel P4 worker nodes.
o  single head-node, NFSv3 server
o  OS - RedHat EL 4, kernel 2.6.12
o  Torque 2
o  Maui 3.2
o  all worker nodes NFS mount /home, /usr/local

After upgrading from Red Hat 7.1 to Red Hat EL 4 we realized that we were
having a 1 in 10 user jobs fail because of a worker node NFS mount point
failing to respond. The NFS mount points on the worker nodes would become
unresponsive during heavy NFS I/O. A simple "netstat -t" on the
head-node showed that there were thousands of open TCP nfs sockets on the
head-node. Worker nodes that had frozen NFS mount points responded with
the following error message:

nfs_statfs: error no = 512

The above error message should be handled in kernel space but somehow was
being reported in user space. The kernel should have handled the
nfs timeout and reconnected transparent to the user. We realized that NFS
v3 defaults to TCP if not explicitly mentioned at mount time. The only
solution for a worker node with a frozen NFS mount point was to reboot the
node. A "remount" works but you need to stop all services using the NFS
mount points.

We recently switched all our NFS mounts to use udp and have had no worker
nodes with failing or unresponsoive NFS mount points.

Thought would share this bit of experience with the list. Interestingly
while googling we did not find a lot of chatter about this issue.

- Amitoj.