[Beowulf] Troubleshooting NFS stale file handles
pbisbal at pppl.gov
Thu Apr 20 14:18:40 PDT 2017
Thanks for the tip. I hadn't even thought of looking at SMART, although
any errors should show up in the logwatch e-mails, which I do check
every day, and haven't seen any on these systems. I also heard recently
that the smartmontools that come with most Linux distros are horribly
old, and the latest version can find a lot more SMART errors than the
old distro-provided versions.
On 04/19/2017 07:11 PM, Neil McFadyen wrote:
> I had a similar problem and it turned out to be a disk problem. SMART
> attributes showed high
> 188 Command_Timeout values for 1 of the disks in the RAID array on the
> storage server.
> The server would become inaccessible, i.e., couldn't even ping it,
> with no errors in the server's logs. Had to reboot the server then it
> would work for a while and then happen again. After changing the disk
> fixed it.
> Neil McFadyen
> Carleton University
> On 2017-04-19 1:58 PM, Prentice Bisbal wrote:
>> Here's the sequence of events:
>> 1. First job(s) run fine on the node and complete without error.
>> 2. Eventually a job fails with a 'permission denied' error when it
>> tries to access /l/hostname.
>> Since no jobs fail with a file I/O error, it's hard to confirm that
>> the jobs themselves are causing the problem. However, if these
>> particular jobs are the only thing running on the cluster and should
>> be the only jobs accessing these NFS shares, what else could be
>> causing them.
>> All these systems are getting their user information from LDAP. Since
>> some jobs run before these errors appear, lack of, or inaccurate user
>> info doesn't seem to be a likely source of this problem, but I'm not
>> ruling anything out at this point.
>> Important detail: This is NFSv3.
>> Prentice Bisbal
>> Lead Software Engineer
>> Princeton Plasma Physics Laboratory
>> On 04/19/2017 12:20 PM, Ryan Novosielski wrote:
>>> Are you saying they can’t mount the filesystem, or they can’t write
>>> to a mounted filesystem? Where does this system get its user
>>> information from, if the latter?
>>> || \\UTGERS, |---------------------------*O*---------------------------
>>> ||_// the State | Ryan Novosielski - novosirj at rutgers.edu
>>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS
>>> || \\ of NJ | Office of Advanced Research Computing - MSB
>>> C630, Newark
>>>> On Apr 19, 2017, at 12:09, Prentice Bisbal <pbisbal at pppl.gov> wrote:
>>>> I've been trying to troubleshoot a problem for the past two weeks
>>>> with no luck. We have a cluster here that runs only one application
>>>> (although the details of that application change significantly from
>>>> run-to-run.). Each node in the cluster has an NFS export, /local,
>>>> that can be automounted by every other node in the cluster as
>>>> Starting about two weeks ago, when jobs would try to access
>>>> /l/hostname, they would get permission denied messages. I tried
>>>> analyzing this problem by turning on all NFS/RPC logging with
>>>> rpcdebug and also using tcpdump while trying to manually mount one
>>>> of the remote systems. Both approaches indicated state file handles
>>>> were prevent the share from being mounted.
>>>> Since it has been 6-8 weeks since there were any seemingly relevant
>>>> system config changes, I suspect it's an application problem
>>>> (naturally). On the other hand, the application developers/users
>>>> insist that they haven't made any changes, to their code, either.
>>>> To be honest, there's no significant evidence indicating either is
>>>> at fault. Any suggestions on how to debug this and definitively
>>>> find the root cause of these stale file handles?
>>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>>>> To change your subscription (digest mode or unsubscribe) visit
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
More information about the Beowulf