[Beowulf] Troubleshooting NFS stale file handles

Wed Apr 19 11:34:43 PDT 2017

On 04/19/2017 02:17 PM, Ellis H. Wilson III wrote:
> On 04/19/2017 02:11 PM, Prentice Bisbal wrote:
>> Thanks for the suggestion(s). Just this morning I started considering
>> the network as a possible source of error. My stale file handle errors
>> are easily fixed by just restarting the nfs servers with 'service nfs
>> restart', so they aren't as severe you describe.
>
> If a restart on solely the /server-side/ gets you back into a good 
> state this is an interesting tidbit.
That is correct, restarting NFS on the server-side is all it takes to 
fix the problem
> Do you have some form of HA setup for NFS?  Automatic failover 
> (sometimes setup with IP aliasing) in the face of network hiccups can 
> occasionally goof the clients if they aren't setup properly to keep up 
> with the change.  A restart of the server will likely revert back to 
> using the primary, resulting in the clients thinking everything is 
> back up and healthy again.  This situation varies so much between 
> vendors it's hard to say much more without more details on your setup.
>
My setup isn't nearly that complicated. Every node in this cluster has a 
/local directory that is shared out to the other nodes in the cluster. 
The other nodes automount this by remote directory as /l/hostname, where 
"hostname" is the name of owner of the filesystem. For example, hostB 
will mount hostA:/local as /l/lhostA.

No fancy fail-over or anything like that.
> Best,
>
> ellis
>
> P.S., apologies for the top-post last time around.
>
NO worries. I'm so used to people doing that, in mailing lists that I've 
become numb to it.

Prentice