[Beowulf] Troubleshooting NFS stale file handles
bs_lists at aakef.fastmail.fm
Wed Apr 19 14:52:25 PDT 2017
On 04/19/2017 07:58 PM, Prentice Bisbal wrote:
> Here's the sequence of events:
> 1. First job(s) run fine on the node and complete without error.
> 2. Eventually a job fails with a 'permission denied' error when it tries
> to access /l/hostname.
So you don't get ESTALE, but you get EACCESS? You *might* be able to fix
this by setting the 'no_subtree_check' in your /etc/exports. I don't
remember the details exactly anymore, but nfsd/exportfs check more
intensively if a dentry is valid if this option is not given.
I don't think that networking can be a cause for this, but if a
dentry/inode is evicted from the server side cache, the NFS file handle
has to be used to create inode and dentry on the server side on the
underlying file system. I think EACCESS is then used if something goes
wrong connecting the dentry to the parent-dentry (I need to look up the
exact details again, it's been while I had to deal with this).
You could try to set /proc/sys/vm/vfs_cache_pressure to a very low value
(don't set it to 0, though). Depending on your file system and kernel
version this might help to keep dentries/inode in the cache and to avoid
running into this (there was bug until 3.10, which prevented that this
worked properly, I'm not sure if the related patch series has been
backported into vendor kernels).
Btw, which kernel version and file system is your nfs server running on?
More information about the Beowulf