[Beowulf] Troubleshooting NFS stale file handles
John.Hearns at xma.co.uk
Wed Apr 19 23:53:14 PDT 2017
Some excellent NFS knowledge in this thread!
I have had my share of NFS troubleshooting and tuning in the past, though can't add anything directly relevant to this problem.
I am posting to flag up the NFS Ganesha project, in case anyone has not heard of it:
And since I am a bit of a fan of Gluster:
From: Beowulf [beowulf-bounces at beowulf.org] on behalf of Bernd Schubert [bs_lists at aakef.fastmail.fm]
Sent: 19 April 2017 23:00
To: John Hanks; Prentice Bisbal; Ryan Novosielski
Cc: Beowulf List
Subject: Re: [Beowulf] Troubleshooting NFS stale file handles
fsid was basically invented to avoid ESTALE when the device numbers
changed (i.e. sda becomes sdb on next reboot or unstable device numbers
as with lvm or different devices on HA servers).
Nowadays the uuid of a file system is used by default and one typically
does not need to set this anymore for V2 and V3. For V4 it needs to be
set for the nfs root only (I think).
On 04/19/2017 08:41 PM, John Hanks wrote:
> I do this for NFSv3 and NFSv4, but all my underlying filesystems are ZFS and
> that was what prompted me to being setting fsid initially. It may be irrelevant
> for NFSv3 and/or non-ZFS filesystems.
> On Wed, Apr 19, 2017 at 9:13 PM Prentice Bisbal <pbisbal at pppl.gov
> <mailto:pbisbal at pppl.gov>> wrote:
> Even with NFSv3? It seems like fsid=0 is required for NFSv4, but does it
> have any impact on NFSv3? I honestly am not an expert of the details of NFS.
> For me, it's always "just worked", and performance was never an issue, so I
> never had much reason to dig into the details of
> tweaking/debugging/optimizing NFS.
> On 04/19/2017 02:07 PM, John Hanks wrote:
>> I've had far fewer unexplained (although admittedly there was a limited
>> search for the guilty) NFS issues since I started using fsid= in my NFS
>> exports. If you aren't setting that it might be worth a try. NFS seems to
>> be much better at recovering from problems with an fsid assigned to the
>> root of exports.
>> On Wed, Apr 19, 2017 at 8:58 PM Prentice Bisbal <pbisbal at pppl.gov
>> <mailto:pbisbal at pppl.gov>> wrote:
>> Here's the sequence of events:
>> 1. First job(s) run fine on the node and complete without error.
>> 2. Eventually a job fails with a 'permission denied' error when it tries
>> to access /l/hostname.
>> Since no jobs fail with a file I/O error, it's hard to confirm that the
>> jobs themselves are causing the problem. However, if these particular
>> jobs are the only thing running on the cluster and should be the only
>> jobs accessing these NFS shares, what else could be causing them.
>> All these systems are getting their user information from LDAP. Since
>> some jobs run before these errors appear, lack of, or inaccurate user
>> info doesn't seem to be a likely source of this problem, but I'm not
>> ruling anything out at this point.
>> Important detail: This is NFSv3.
>> Prentice Bisbal
>> Lead Software Engineer
>> Princeton Plasma Physics Laboratory
>> On 04/19/2017 12:20 PM, Ryan Novosielski wrote:
>> > Are you saying they can’t mount the filesystem, or they can’t write
>> to a mounted filesystem? Where does this system get its user
>> information from, if the latter?
>> > --
>> > ____
>> > || \\UTGERS, |---------------------------*O*---------------------------
>> > ||_// the State | Ryan Novosielski -
>> novosirj at rutgers.edu <mailto:novosirj at rutgers.edu>
>> > || \\ University | Sr. Technologist - 973/972.0922
>> <tel:%28973%29%20972-0922> (2x0922) ~*~ RBHS Campus
>> > || \\ of NJ | Office of Advanced Research Computing - MSB
>> C630, Newark
>> > `'
>> >> On Apr 19, 2017, at 12:09, Prentice Bisbal <pbisbal at pppl.gov
>> <mailto:pbisbal at pppl.gov>> wrote:
>> >> Beowulfers,
>> >> I've been trying to troubleshoot a problem for the past two weeks
>> with no luck. We have a cluster here that runs only one application
>> (although the details of that application change significantly from
>> run-to-run.). Each node in the cluster has an NFS export, /local, that
>> can be automounted by every other node in the cluster as /l/hostname.
>> >> Starting about two weeks ago, when jobs would try to access
>> /l/hostname, they would get permission denied messages. I tried
>> analyzing this problem by turning on all NFS/RPC logging with rpcdebug
>> and also using tcpdump while trying to manually mount one of the
>> remote systems. Both approaches indicated state file handles were
>> prevent the share from being mounted.
>> >> Since it has been 6-8 weeks since there were any seemingly relevant
>> system config changes, I suspect it's an application problem
>> (naturally). On the other hand, the application developers/users
>> insist that they haven't made any changes, to their code, either. To
>> be honest, there's no significant evidence indicating either is at
>> fault. Any suggestions on how to debug this and definitively find the
>> root cause of these stale file handles?
>> >> --
>> >> Prentice
>> >> _______________________________________________
>> >> Beowulf mailing list, Beowulf at beowulf.org
>> <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
>> >> To change your subscription (digest mode or unsubscribe) visit
>> Beowulf mailing list, Beowulf at beowulf.org <mailto:Beowulf at beowulf.org>
>> sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> ‘[A] talent for following the ways of yesterday, is not sufficient to
>> improve the world of today.’
>> - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
> ‘[A] talent for following the ways of yesterday, is not sufficient to improve
> the world of today.’
> - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the company. Employees of XMA Ltd are expressly required not to make defamatory statements and not to infringe or authorise any infringement of copyright or any other legal right by email communications. Any such communication is contrary to company policy and outside the scope of the employment of the individual concerned. The company will not accept any liability in respect of such communication, and the employee responsible will be personally liable for any damages or other liability arising. XMA Limited is registered in England and Wales (registered no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, Wilford, Nottingham, NG11 7EP
More information about the Beowulf