[Beowulf] Troubleshooting NFS stale file handles

Prentice Bisbal pbisbal at pppl.gov
Thu Apr 20 14:14:21 PDT 2017


On 04/19/2017 05:52 PM, Bernd Schubert wrote:

>
> On 04/19/2017 07:58 PM, Prentice Bisbal wrote:
>> Here's the sequence of events:
>>
>> 1. First job(s) run fine on the node and complete without error.
>>
>> 2. Eventually a job fails with a 'permission denied' error when it tries
>> to access /l/hostname.
> So you don't get ESTALE, but you get EACCESS? You *might* be able to fix
> this by setting the 'no_subtree_check' in your /etc/exports. I don't
> remember the details exactly anymore, but nfsd/exportfs check more
> intensively if a dentry is valid if this option is not given.

I don't remember seeing either ESTALE or EACCESS, just that there was a 
message about stale file handles. I didn't save the messages I with 
tcpdump, and I had to delete my /var/log/message files because when 
turned all the logging I could with rpcdebug, it filled up /var in less 
than a day, and I needed to free up space in /var. I should have copied 
them somewhere else instead of just deleting them, in hindsight.

I rebooted the systems yesterday, and the problem has gone away since 
the reboot, so I can't reproduce the problem and send you the relevant 
messages. I"m not a smart man.

>
> I don't think that networking can be a cause for this, but if a
> dentry/inode is evicted from the server side cache, the NFS file handle
> has to be used to create inode and dentry on the server side on the
> underlying file system. I think EACCESS is then used if something goes
> wrong connecting the dentry to the parent-dentry (I need to look up the
> exact details again, it's been while I had to deal with this).
Are these meanings of EACESS and ESTALE defined in the NFS RFCs? If so, 
may need to read that.
>
> You could try to set /proc/sys/vm/vfs_cache_pressure to a very low value
> (don't set it to 0, though). Depending on your file system and kernel
> version this might help to keep dentries/inode in the cache and to avoid
> running into this (there was bug until 3.10, which prevented that this
> worked properly, I'm not sure if the related patch series has been
> backported into vendor kernels).
Thanks for the tip. I'll keep it in mind.
>
> Btw, which kernel version and file system is your nfs server running on?
Both servers and clients are running the same exact version of 
everything, since they are using the same NFS root filesystem:

$ cat /etc/redhat-release
CentOS release 6.8 (Final)

$ cat /proc/version
Linux version 2.6.32-642.11.1.el6.x86_64 
(mockbuild at c1bm.rdu2.centos.org) (gcc version 4.4.7 20120313 (Red Hat 
4.4.7-17) (GCC) ) #1 SMP Fri Nov 18 19:25:05 UTC 2016

$ rpm -qa | grep -i nfs
nfs-utils-lib-1.1.5-11.el6.x86_64
nfs-utils-1.2.3-70.el6_8.2.x86_64
nfs4-acl-tools-0.3.3-8.el6.x86_64


>
>
> Bernd



More information about the Beowulf mailing list