[Beowulf] Troubleshooting NFS stale file handles

Prentice Bisbal pbisbal at pppl.gov
Thu Apr 20 14:18:40 PDT 2017


Thanks for the tip. I hadn't even thought of looking at SMART, although 
any errors should show up in the logwatch e-mails, which I do check 
every day, and haven't seen any on these systems. I also heard recently 
that the smartmontools that come with most Linux distros are horribly 
old, and the latest version can find a lot more SMART errors than the 
old distro-provided versions.

Prentice

On 04/19/2017 07:11 PM, Neil McFadyen wrote:
> I had a similar problem and it turned out to be a disk problem.  SMART 
> attributes showed high
> 188 Command_Timeout values for 1 of the disks in the RAID array on the 
> storage server.
> The server would become inaccessible, i.e., couldn't even ping it,  
> with no errors in the server's logs.  Had to reboot the server then it 
> would work for a while and then happen again. After changing the disk 
> fixed it.
>
> Neil McFadyen
> Carleton University
>
> On 2017-04-19 1:58 PM, Prentice Bisbal wrote:
>> Here's the sequence of events:
>>
>> 1. First job(s) run fine on the node and complete without error.
>>
>> 2. Eventually a job fails with a 'permission denied' error when it 
>> tries to access /l/hostname.
>>
>> Since no jobs fail with a file I/O error, it's hard to confirm that 
>> the jobs themselves are causing the problem. However, if these 
>> particular jobs are the only thing running on the cluster and should 
>> be the only jobs accessing these NFS shares, what else could be 
>> causing them.
>>
>> All these systems are getting their user information from LDAP. Since 
>> some jobs run before these errors appear, lack of, or inaccurate user 
>> info doesn't seem to be a likely source of this problem, but I'm not 
>> ruling anything out at this point.
>>
>> Important detail: This is NFSv3.
>>
>> Prentice Bisbal
>> Lead Software Engineer
>> Princeton Plasma Physics Laboratory
>> http://www.pppl.gov
>>
>> On 04/19/2017 12:20 PM, Ryan Novosielski wrote:
>>> Are you saying they can’t mount the filesystem, or they can’t write 
>>> to a mounted filesystem? Where does this system get its user 
>>> information from, if the latter?
>>>
>>> -- 
>>> ____
>>> || \\UTGERS, |---------------------------*O*---------------------------
>>> ||_// the State     |         Ryan Novosielski - novosirj at rutgers.edu
>>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS 
>>> Campus
>>> ||  \\    of NJ     | Office of Advanced Research Computing - MSB 
>>> C630, Newark
>>>       `'
>>>
>>>> On Apr 19, 2017, at 12:09, Prentice Bisbal <pbisbal at pppl.gov> wrote:
>>>>
>>>> Beowulfers,
>>>>
>>>> I've been trying to troubleshoot a problem for the past two weeks 
>>>> with no luck. We have a cluster here that runs only one application 
>>>> (although the details of that application change significantly from 
>>>> run-to-run.). Each node in the cluster has an NFS export, /local, 
>>>> that can be automounted by every other node in the cluster as 
>>>> /l/hostname.
>>>>
>>>> Starting about two weeks ago, when jobs would try to access 
>>>> /l/hostname, they would get permission denied messages. I tried 
>>>> analyzing this problem by turning on all NFS/RPC logging with 
>>>> rpcdebug and also using tcpdump while trying to manually mount one 
>>>> of the remote systems. Both approaches indicated state file handles 
>>>> were prevent the share from being mounted.
>>>>
>>>> Since it has been 6-8 weeks since there were any seemingly relevant 
>>>> system config changes, I suspect it's an application problem 
>>>> (naturally). On the other hand, the application developers/users 
>>>> insist that they haven't made any changes, to their code, either. 
>>>> To be honest, there's no significant evidence indicating either is 
>>>> at fault. Any suggestions on how to debug this and definitively 
>>>> find the root cause of these stale file handles?
>>>>
>>>> -- 
>>>> Prentice
>>>> _______________________________________________
>>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin 
>>>> Computing
>>>> To change your subscription (digest mode or unsubscribe) visit 
>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit 
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>
>



More information about the Beowulf mailing list