[Beowulf] Troubleshooting NFS stale file handles

Wed Apr 19 11:11:29 PDT 2017

Ellis,

Thanks for the suggestion(s). Just this morning I started considering 
the network as a possible source of error. My stale file handle errors 
are easily fixed by just restarting the nfs servers with 'service nfs 
restart', so they aren't as severe you describe.

Prentice

On 04/19/2017 02:03 PM, Ellis H. Wilson III wrote:
> Here are a couple conditions to look for that I've seen stale NFS file 
> handles caused by.  These are rather high-level to just get you 
> started.  Sorry, short on time today:
>
> 1. Are you sure your NFS server isn't getting swamped by the jobs such 
> that it drops packets back to the clients?  Completely overwhelming an 
> NFS server for sufficient lengths of time might cause this, though 
> it's rare.
>
> 2. Are you sure that your clients (and the NFS server itself) has a 
> solid network connection?  Frequent network hiccups can trigger stale 
> NFS file handles that occasionally require a hard reboot for me.  This 
> is the more common case I see.
>
> Both of these essentially relate to the same thing, which is the 
> connection between the NFS server and clients becoming stalled for too 
> long a time at some point.  In theory NFS should deal with this 
> gracefully, but there are corner-cases (that ironically get hit more 
> often than I feel like they should) where it gets stuck in a way 
> that's rather sticky and tends to require reboot.
>
> Best,
>
> ellis
>
> On 04/19/2017 01:58 PM, Prentice Bisbal wrote:
>> Here's the sequence of events:
>>
>> 1. First job(s) run fine on the node and complete without error.
>>
>> 2. Eventually a job fails with a 'permission denied' error when it tries
>> to access /l/hostname.
>>
>> Since no jobs fail with a file I/O error, it's hard to confirm that the
>> jobs themselves are causing the problem. However, if these particular
>> jobs are the only thing running on the cluster and should be the only
>> jobs accessing these NFS shares, what else could be causing them.
>>
>> All these systems are getting their user information from LDAP. Since
>> some jobs run before these errors appear, lack of, or inaccurate user
>> info doesn't seem to be a likely source of this problem, but I'm not
>> ruling anything out at this point.
>>
>> Important detail: This is NFSv3.
>>
>> Prentice Bisbal
>> Lead Software Engineer
>> Princeton Plasma Physics Laboratory
>> http://www.pppl.gov
>>
>> On 04/19/2017 12:20 PM, Ryan Novosielski wrote:
>>> Are you saying they can’t mount the filesystem, or they can’t write to
>>> a mounted filesystem? Where does this system get its user information
>>> from, if the latter?
>>>
>>> -- 
>>> ____
>>> || \\UTGERS,
>>> |---------------------------*O*---------------------------
>>> ||_// the State     |         Ryan Novosielski - novosirj at rutgers.edu
>>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS
>>> Campus
>>> ||  \\    of NJ     | Office of Advanced Research Computing - MSB
>>> C630, Newark
>>>       `'
>>>
>>>> On Apr 19, 2017, at 12:09, Prentice Bisbal <pbisbal at pppl.gov> wrote:
>>>>
>>>> Beowulfers,
>>>>
>>>> I've been trying to troubleshoot a problem for the past two weeks
>>>> with no luck. We have a cluster here that runs only one application
>>>> (although the details of that application change significantly from
>>>> run-to-run.). Each node in the cluster has an NFS export, /local,
>>>> that can be automounted by every other node in the cluster as
>>>> /l/hostname.
>>>>
>>>> Starting about two weeks ago, when jobs would try to access
>>>> /l/hostname, they would get permission denied messages. I tried
>>>> analyzing this problem by turning on all NFS/RPC logging with
>>>> rpcdebug and also using tcpdump while trying to manually mount one of
>>>> the remote systems. Both approaches indicated state file handles were
>>>> prevent the share from being mounted.
>>>>
>>>> Since it has been 6-8 weeks since there were any seemingly relevant
>>>> system config changes, I suspect it's an application problem
>>>> (naturally). On the other hand, the application developers/users
>>>> insist that they haven't made any changes, to their code, either. To
>>>> be honest, there's no significant evidence indicating either is at
>>>> fault. Any suggestions on how to debug this and definitively find the
>>>> root cause of these stale file handles?
>>>>
>>>> -- 
>>>> Prentice
>>>> _______________________________________________
>>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin 
>>>> Computing
>>>> To change your subscription (digest mode or unsubscribe) visit
>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>
>