[Beowulf] Troubleshooting NFS stale file handles
griznog at gmail.com
Wed Apr 19 11:07:27 PDT 2017
I've had far fewer unexplained (although admittedly there was a limited
search for the guilty) NFS issues since I started using fsid= in my NFS
exports. If you aren't setting that it might be worth a try. NFS seems to
be much better at recovering from problems with an fsid assigned to the
root of exports.
On Wed, Apr 19, 2017 at 8:58 PM Prentice Bisbal <pbisbal at pppl.gov> wrote:
> Here's the sequence of events:
> 1. First job(s) run fine on the node and complete without error.
> 2. Eventually a job fails with a 'permission denied' error when it tries
> to access /l/hostname.
> Since no jobs fail with a file I/O error, it's hard to confirm that the
> jobs themselves are causing the problem. However, if these particular
> jobs are the only thing running on the cluster and should be the only
> jobs accessing these NFS shares, what else could be causing them.
> All these systems are getting their user information from LDAP. Since
> some jobs run before these errors appear, lack of, or inaccurate user
> info doesn't seem to be a likely source of this problem, but I'm not
> ruling anything out at this point.
> Important detail: This is NFSv3.
> Prentice Bisbal
> Lead Software Engineer
> Princeton Plasma Physics Laboratory
> On 04/19/2017 12:20 PM, Ryan Novosielski wrote:
> > Are you saying they can’t mount the filesystem, or they can’t write to a
> mounted filesystem? Where does this system get its user information from,
> if the latter?
> > --
> > ____
> > || \\UTGERS,
> > ||_// the State | Ryan Novosielski - novosirj at rutgers.edu
> > || \\ University | Sr. Technologist - 973/972.0922 <(973)%20972-0922>
> (2x0922) ~*~ RBHS Campus
> > || \\ of NJ | Office of Advanced Research Computing - MSB
> C630, Newark
> > `'
> >> On Apr 19, 2017, at 12:09, Prentice Bisbal <pbisbal at pppl.gov> wrote:
> >> Beowulfers,
> >> I've been trying to troubleshoot a problem for the past two weeks with
> no luck. We have a cluster here that runs only one application (although
> the details of that application change significantly from run-to-run.).
> Each node in the cluster has an NFS export, /local, that can be automounted
> by every other node in the cluster as /l/hostname.
> >> Starting about two weeks ago, when jobs would try to access
> /l/hostname, they would get permission denied messages. I tried analyzing
> this problem by turning on all NFS/RPC logging with rpcdebug and also using
> tcpdump while trying to manually mount one of the remote systems. Both
> approaches indicated state file handles were prevent the share from being
> >> Since it has been 6-8 weeks since there were any seemingly relevant
> system config changes, I suspect it's an application problem (naturally).
> On the other hand, the application developers/users insist that they
> haven't made any changes, to their code, either. To be honest, there's no
> significant evidence indicating either is at fault. Any suggestions on how
> to debug this and definitively find the root cause of these stale file
> >> --
> >> Prentice
> >> _______________________________________________
> >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> >> To change your subscription (digest mode or unsubscribe) visit
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
‘[A] talent for following the ways of yesterday, is not sufficient to
improve the world of today.’
- King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf