[Beowulf] Troubleshooting NFS stale file handles

Prentice Bisbal pbisbal at pppl.gov
Wed Apr 19 09:09:01 PDT 2017


I've been trying to troubleshoot a problem for the past two weeks with 
no luck. We have a cluster here that runs only one application (although 
the details of that application change significantly from run-to-run.). 
Each node in the cluster has an NFS export, /local, that can be 
automounted by every other node in the cluster as /l/hostname.

Starting about two weeks ago, when jobs would try to access /l/hostname, 
they would get permission denied messages. I tried analyzing this 
problem by turning on all NFS/RPC logging with rpcdebug and also using 
tcpdump while trying to manually mount one of the remote systems. Both 
approaches indicated state file handles were prevent the share from 
being mounted.

Since it has been 6-8 weeks since there were any seemingly relevant 
system config changes, I suspect it's an application problem 
(naturally). On the other hand, the application developers/users insist 
that they haven't made any changes, to their code, either. To be honest, 
there's no significant evidence indicating either is at fault. Any 
suggestions on how to debug this and definitively find the root cause of 
these stale file handles?


