[Beowulf] NFSv3 client hangs - tcp v/s udp.

Amrik Singh asingh at ideeinc.com
Thu May 4 09:57:32 PDT 2006


Have you enabled the jumbo-frames on your network? The man page for NFS 
has a big warning against using NFS over udp.

Amrik 



Amitoj G. Singh wrote:

>Cluster Details:
>================
>o  648 single processor Intel P4 worker nodes.
>o  single head-node, NFSv3 server
>o  OS - RedHat EL 4, kernel 2.6.12
>o  Torque 2
>o  Maui 3.2
>o  all worker nodes NFS mount /home, /usr/local
>
>After upgrading from Red Hat 7.1 to Red Hat EL 4 we realized that we were
>having a 1 in 10 user jobs fail because of a worker node NFS mount point
>failing to respond. The NFS mount points on the worker nodes would become
>unresponsive during heavy NFS I/O. A simple "netstat -t" on the
>head-node showed that there were thousands of open TCP nfs sockets on the
>head-node. Worker nodes that had frozen NFS mount points responded with
>the following error message:
>
>nfs_statfs: error no = 512
>
>The above error message should be handled in kernel space but somehow was
>being reported in user space. The kernel should have handled the
>nfs timeout and reconnected transparent to the user. We realized that NFS
>v3 defaults to TCP if not explicitly mentioned at mount time. The only
>solution for a worker node with a frozen NFS mount point was to reboot the
>node. A "remount" works but you need to stop all services using the NFS
>mount points.
>
>We recently switched all our NFS mounts to use udp and have had no worker
>nodes with failing or unresponsoive NFS mount points.
>
>Thought would share this bit of experience with the list. Interestingly
>while googling we did not find a lot of chatter about this issue.
>
>- Amitoj.
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
>  
>



More information about the Beowulf mailing list