[Beowulf] Troubleshooting NFS stale file handles
pbisbal at pppl.gov
Wed Apr 19 09:09:01 PDT 2017
I've been trying to troubleshoot a problem for the past two weeks with
no luck. We have a cluster here that runs only one application (although
the details of that application change significantly from run-to-run.).
Each node in the cluster has an NFS export, /local, that can be
automounted by every other node in the cluster as /l/hostname.
Starting about two weeks ago, when jobs would try to access /l/hostname,
they would get permission denied messages. I tried analyzing this
problem by turning on all NFS/RPC logging with rpcdebug and also using
tcpdump while trying to manually mount one of the remote systems. Both
approaches indicated state file handles were prevent the share from
Since it has been 6-8 weeks since there were any seemingly relevant
system config changes, I suspect it's an application problem
(naturally). On the other hand, the application developers/users insist
that they haven't made any changes, to their code, either. To be honest,
there's no significant evidence indicating either is at fault. Any
suggestions on how to debug this and definitively find the root cause of
these stale file handles?
More information about the Beowulf