[Beowulf] Troubleshooting NFS stale file handles
Prentice Bisbal
pbisbal at pppl.gov
Wed Apr 19 11:15:02 PDT 2017
Even with NFSv3? It seems like fsid=0 is required for NFSv4, but does it
have any impact on NFSv3? I honestly am not an expert of the details of
NFS. For me, it's always "just worked", and performance was never an
issue, so I never had much reason to dig into the details of
tweaking/debugging/optimizing NFS.
Prentice
On 04/19/2017 02:07 PM, John Hanks wrote:
> I've had far fewer unexplained (although admittedly there was a
> limited search for the guilty) NFS issues since I started using fsid=
> in my NFS exports. If you aren't setting that it might be worth a try.
> NFS seems to be much better at recovering from problems with an fsid
> assigned to the root of exports.
>
> jbh
>
> On Wed, Apr 19, 2017 at 8:58 PM Prentice Bisbal <pbisbal at pppl.gov
> <mailto:pbisbal at pppl.gov>> wrote:
>
> Here's the sequence of events:
>
> 1. First job(s) run fine on the node and complete without error.
>
> 2. Eventually a job fails with a 'permission denied' error when it
> tries
> to access /l/hostname.
>
> Since no jobs fail with a file I/O error, it's hard to confirm
> that the
> jobs themselves are causing the problem. However, if these particular
> jobs are the only thing running on the cluster and should be the only
> jobs accessing these NFS shares, what else could be causing them.
>
> All these systems are getting their user information from LDAP. Since
> some jobs run before these errors appear, lack of, or inaccurate user
> info doesn't seem to be a likely source of this problem, but I'm not
> ruling anything out at this point.
>
> Important detail: This is NFSv3.
>
> Prentice Bisbal
> Lead Software Engineer
> Princeton Plasma Physics Laboratory
> http://www.pppl.gov
>
> On 04/19/2017 12:20 PM, Ryan Novosielski wrote:
> > Are you saying they can’t mount the filesystem, or they can’t
> write to a mounted filesystem? Where does this system get its user
> information from, if the latter?
> >
> > --
> > ____
> > || \\UTGERS,
> |---------------------------*O*---------------------------
> > ||_// the State | Ryan Novosielski -
> novosirj at rutgers.edu <mailto:novosirj at rutgers.edu>
> > || \\ University | Sr. Technologist - 973/972.0922
> <tel:%28973%29%20972-0922> (2x0922) ~*~ RBHS Campus
> > || \\ of NJ | Office of Advanced Research Computing -
> MSB C630, Newark
> > `'
> >
> >> On Apr 19, 2017, at 12:09, Prentice Bisbal <pbisbal at pppl.gov
> <mailto:pbisbal at pppl.gov>> wrote:
> >>
> >> Beowulfers,
> >>
> >> I've been trying to troubleshoot a problem for the past two
> weeks with no luck. We have a cluster here that runs only one
> application (although the details of that application change
> significantly from run-to-run.). Each node in the cluster has an
> NFS export, /local, that can be automounted by every other node in
> the cluster as /l/hostname.
> >>
> >> Starting about two weeks ago, when jobs would try to access
> /l/hostname, they would get permission denied messages. I tried
> analyzing this problem by turning on all NFS/RPC logging with
> rpcdebug and also using tcpdump while trying to manually mount one
> of the remote systems. Both approaches indicated state file
> handles were prevent the share from being mounted.
> >>
> >> Since it has been 6-8 weeks since there were any seemingly
> relevant system config changes, I suspect it's an application
> problem (naturally). On the other hand, the application
> developers/users insist that they haven't made any changes, to
> their code, either. To be honest, there's no significant evidence
> indicating either is at fault. Any suggestions on how to debug
> this and definitively find the root cause of these stale file handles?
> >>
> >> --
> >> Prentice
> >> _______________________________________________
> >> Beowulf mailing list, Beowulf at beowulf.org
> <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
> >> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
> --
> ‘[A] talent for following the ways of yesterday, is not sufficient to
> improve the world of today.’
> - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20170419/d5fb675b/attachment.html>
More information about the Beowulf
mailing list