[Beowulf] Troubleshooting NFS stale file handles

Prentice Bisbal pbisbal at pppl.gov
Wed Apr 19 11:15:02 PDT 2017


Even with NFSv3? It seems like fsid=0 is required for NFSv4, but does it 
have any impact on NFSv3? I honestly am not an expert of the details of 
NFS. For me, it's always "just worked", and performance was never an 
issue,  so I never had much reason to dig into the details of 
tweaking/debugging/optimizing NFS.

Prentice

On 04/19/2017 02:07 PM, John Hanks wrote:
> I've had far fewer unexplained (although admittedly there was a 
> limited search for the guilty) NFS issues since I started using fsid= 
> in my NFS exports. If you aren't setting that it might be worth a try. 
> NFS seems to be much better at recovering from problems with an fsid 
> assigned to the root of exports.
>
> jbh
>
> On Wed, Apr 19, 2017 at 8:58 PM Prentice Bisbal <pbisbal at pppl.gov 
> <mailto:pbisbal at pppl.gov>> wrote:
>
>     Here's the sequence of events:
>
>     1. First job(s) run fine on the node and complete without error.
>
>     2. Eventually a job fails with a 'permission denied' error when it
>     tries
>     to access /l/hostname.
>
>     Since no jobs fail with a file I/O error, it's hard to confirm
>     that the
>     jobs themselves are causing the problem. However, if these particular
>     jobs are the only thing running on the cluster and should be the only
>     jobs accessing these NFS shares, what else could be causing them.
>
>     All these systems are getting their user information from LDAP. Since
>     some jobs run before these errors appear, lack of, or inaccurate user
>     info doesn't seem to be a likely source of this problem, but I'm not
>     ruling anything out at this point.
>
>     Important detail: This is NFSv3.
>
>     Prentice Bisbal
>     Lead Software Engineer
>     Princeton Plasma Physics Laboratory
>     http://www.pppl.gov
>
>     On 04/19/2017 12:20 PM, Ryan Novosielski wrote:
>     > Are you saying they can’t mount the filesystem, or they can’t
>     write to a mounted filesystem? Where does this system get its user
>     information from, if the latter?
>     >
>     > --
>     > ____
>     > || \\UTGERS,
>      |---------------------------*O*---------------------------
>     > ||_// the State        |         Ryan Novosielski -
>     novosirj at rutgers.edu <mailto:novosirj at rutgers.edu>
>     > || \\ University | Sr. Technologist - 973/972.0922
>     <tel:%28973%29%20972-0922> (2x0922) ~*~ RBHS Campus
>     > ||  \\    of NJ        | Office of Advanced Research Computing -
>     MSB C630, Newark
>     >       `'
>     >
>     >> On Apr 19, 2017, at 12:09, Prentice Bisbal <pbisbal at pppl.gov
>     <mailto:pbisbal at pppl.gov>> wrote:
>     >>
>     >> Beowulfers,
>     >>
>     >> I've been trying to troubleshoot a problem for the past two
>     weeks with no luck. We have a cluster here that runs only one
>     application (although the details of that application change
>     significantly from run-to-run.). Each node in the cluster has an
>     NFS export, /local, that can be automounted by every other node in
>     the cluster as /l/hostname.
>     >>
>     >> Starting about two weeks ago, when jobs would try to access
>     /l/hostname, they would get permission denied messages. I tried
>     analyzing this problem by turning on all NFS/RPC logging with
>     rpcdebug and also using tcpdump while trying to manually mount one
>     of the remote systems. Both approaches indicated state file
>     handles were prevent the share from being mounted.
>     >>
>     >> Since it has been 6-8 weeks since there were any seemingly
>     relevant system config changes, I suspect it's an application
>     problem (naturally). On the other hand, the application
>     developers/users insist that they haven't made any changes, to
>     their code, either. To be honest, there's no significant evidence
>     indicating either is at fault. Any suggestions on how to debug
>     this and definitively find the root cause of these stale file handles?
>     >>
>     >> --
>     >> Prentice
>     >> _______________________________________________
>     >> Beowulf mailing list, Beowulf at beowulf.org
>     <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
>     >> To change your subscription (digest mode or unsubscribe) visit
>     http://www.beowulf.org/mailman/listinfo/beowulf
>
>     _______________________________________________
>     Beowulf mailing list, Beowulf at beowulf.org
>     <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
>     To change your subscription (digest mode or unsubscribe) visit
>     http://www.beowulf.org/mailman/listinfo/beowulf
>
> -- 
> ‘[A] talent for following the ways of yesterday, is not sufficient to 
> improve the world of today.’
>  - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20170419/d5fb675b/attachment.html>


More information about the Beowulf mailing list