[Beowulf] Troubleshooting NFS stale file handles

Thu Apr 20 02:14:24 PDT 2017

On Wed, Apr 19, 2017 at 8:34 PM, Prentice Bisbal <pbisbal at pppl.gov> wrote:
> My setup isn't nearly that complicated. Every node in this cluster has a
> /local directory that is shared out to the other nodes in the cluster. The
> other nodes automount this by remote directory as /l/hostname, where
> "hostname" is the name of owner of the filesystem. For example, hostB will
> mount hostA:/local as /l/lhostA.

Some more questions to provide a better picture:
- at the time the error message appears, are there several hostB
mounting the same export from hostA ? If so, do they all experience
the error condition ?
- is the one application the only way to trigger the error message ?
Or are you able (as root or as the user running the application) able
to also reproduce the problem using simple tools like ls and cat ? If
not, what is the output from the tools when the problem appears ?
- do you use Kerberos or some similar mechanism where the access is
limited in time (for Kerberos by the lifetime of the ticket) ?
- have you tried to fix the client side instead of the nfsd restart on
the server side, f.e. by restarting autofs, forcing manual unmount
then mount, etc ?
- do you have logs of the activity of autofs and can check what remote
FSes are mounted (or not...) when the error condition appears ?
- is the one application run by a single user ? If so, the error
message can only mean access to system files. Does the error occur in
the same place in the application ? Do you have the source code of the
application to add code to better describe the failure conditions ?

Cheers,
Bogdan