[Beowulf] Troubleshooting NFS stale file handles

John Hanks griznog at gmail.com
Wed Apr 19 11:41:59 PDT 2017


I do this for NFSv3 and NFSv4, but all my underlying filesystems are ZFS
and that was what prompted me to being setting fsid initially. It may be
irrelevant for NFSv3 and/or non-ZFS filesystems.

jbh

On Wed, Apr 19, 2017 at 9:13 PM Prentice Bisbal <pbisbal at pppl.gov> wrote:

> Even with NFSv3? It seems like fsid=0 is required for NFSv4, but does it
> have any impact on NFSv3? I honestly am not an expert of the details of
> NFS. For me, it's always "just worked", and performance was never an
> issue,  so I never had much reason to dig into the details of
> tweaking/debugging/optimizing NFS.
>
> Prentice
>
> On 04/19/2017 02:07 PM, John Hanks wrote:
>
> I've had far fewer unexplained (although admittedly there was a limited
> search for the guilty) NFS issues since I started using fsid= in my NFS
> exports. If you aren't setting that it might be worth a try. NFS seems to
> be much better at recovering from problems with an fsid assigned to the
> root of exports.
>
> jbh
>
> On Wed, Apr 19, 2017 at 8:58 PM Prentice Bisbal <pbisbal at pppl.gov> wrote:
>
>> Here's the sequence of events:
>>
>> 1. First job(s) run fine on the node and complete without error.
>>
>> 2. Eventually a job fails with a 'permission denied' error when it tries
>> to access /l/hostname.
>>
>> Since no jobs fail with a file I/O error, it's hard to confirm that the
>> jobs themselves are causing the problem. However, if these particular
>> jobs are the only thing running on the cluster and should be the only
>> jobs accessing these NFS shares, what else could be causing them.
>>
>> All these systems are getting their user information from LDAP. Since
>> some jobs run before these errors appear, lack of, or inaccurate user
>> info doesn't seem to be a likely source of this problem, but I'm not
>> ruling anything out at this point.
>>
>> Important detail: This is NFSv3.
>>
>> Prentice Bisbal
>> Lead Software Engineer
>> Princeton Plasma Physics Laboratory
>> http://www.pppl.gov
>>
>> On 04/19/2017 12:20 PM, Ryan Novosielski wrote:
>> > Are you saying they can’t mount the filesystem, or they can’t write to
>> a mounted filesystem? Where does this system get its user information from,
>> if the latter?
>> >
>> > --
>> > ____
>> > || \\UTGERS,
>>  |---------------------------*O*---------------------------
>> > ||_// the State        |         Ryan Novosielski -
>> novosirj at rutgers.edu
>> > || \\ University | Sr. Technologist - 973/972.0922
>> <%28973%29%20972-0922> (2x0922) ~*~ RBHS Campus
>> > ||  \\    of NJ        | Office of Advanced Research Computing - MSB
>> C630, Newark
>> >       `'
>> >
>> >> On Apr 19, 2017, at 12:09, Prentice Bisbal <pbisbal at pppl.gov> wrote:
>> >>
>> >> Beowulfers,
>> >>
>> >> I've been trying to troubleshoot a problem for the past two weeks with
>> no luck. We have a cluster here that runs only one application (although
>> the details of that application change significantly from run-to-run.).
>> Each node in the cluster has an NFS export, /local, that can be automounted
>> by every other node in the cluster as /l/hostname.
>> >>
>> >> Starting about two weeks ago, when jobs would try to access
>> /l/hostname, they would get permission denied messages. I tried analyzing
>> this problem by turning on all NFS/RPC logging with rpcdebug and also using
>> tcpdump while trying to manually mount one of the remote systems. Both
>> approaches indicated state file handles were prevent the share from being
>> mounted.
>> >>
>> >> Since it has been 6-8 weeks since there were any seemingly relevant
>> system config changes, I suspect it's an application problem (naturally).
>> On the other hand, the application developers/users insist that they
>> haven't made any changes, to their code, either. To be honest, there's no
>> significant evidence indicating either is at fault. Any suggestions on how
>> to debug this and definitively find the root cause of these stale file
>> handles?
>> >>
>> >> --
>> >> Prentice
>> >> _______________________________________________
>> >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>> Computing
>> >> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
> --
> ‘[A] talent for following the ways of yesterday, is not sufficient to
> improve the world of today.’
>  - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
>
>
> --
‘[A] talent for following the ways of yesterday, is not sufficient to
improve the world of today.’
 - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20170419/f0afe12b/attachment.html>


More information about the Beowulf mailing list