[Beowulf] Troubleshooting NFS stale file handles

Wed Apr 19 12:21:00 PDT 2017

Hi Prentice,

three questions (not necessarily to you and it can be dealt with in a different 
thread too):

- why automount and not a static mount?
- do I get that right that the nodes itself export shares to other nodes?
- has anything changed? I am thinking of something like more nodes added, new 
programs being installed, more users added, generally a higher load on the 
cluster.

One problem I had in the past with my 112 node cluster where I am exporting 
/home, /opt and one directory in /usr/local to all the nodes from the headnode 
was that the NFS-server on the headnode did not have enough spare servers 
assigned and thus was running out of capacity. That also lead to strange 
behaviour which I fixed by increasing the numbers of spare servers. 

The way I have done that was setting this in 
/etc/default/nfs-kernel-server

# Number of servers to start up
RPCNFSDCOUNT=32

That seems to provide the right amount of servers and spare ones for me. 
Like in your case, the cluster was running stable until I added more nodes 
*and* users decided to use them, i.e. the load of the cluster got up. A more 
idle cluster did not show any problems, a cluster under 80 % load suddenly had 
problem. 

I hope that helps a bit. I am not the expert in NFS as well and this is just 
my experience. I am also using Debian nfs-kernel-server 1:1.2.6-4 if that 
helps. 

All the best from a sunny London

Jörg

On Mittwoch 19 April 2017 Prentice Bisbal wrote:
> On 04/19/2017 02:17 PM, Ellis H. Wilson III wrote:
> > On 04/19/2017 02:11 PM, Prentice Bisbal wrote:
> >> Thanks for the suggestion(s). Just this morning I started considering
> >> the network as a possible source of error. My stale file handle errors
> >> are easily fixed by just restarting the nfs servers with 'service nfs
> >> restart', so they aren't as severe you describe.
> > 
> > If a restart on solely the /server-side/ gets you back into a good
> > state this is an interesting tidbit.
> 
> That is correct, restarting NFS on the server-side is all it takes to
> fix the problem
> 
> > Do you have some form of HA setup for NFS?  Automatic failover
> > (sometimes setup with IP aliasing) in the face of network hiccups can
> > occasionally goof the clients if they aren't setup properly to keep up
> > with the change.  A restart of the server will likely revert back to
> > using the primary, resulting in the clients thinking everything is
> > back up and healthy again.  This situation varies so much between
> > vendors it's hard to say much more without more details on your setup.
> 
> My setup isn't nearly that complicated. Every node in this cluster has a
> /local directory that is shared out to the other nodes in the cluster.
> The other nodes automount this by remote directory as /l/hostname, where
> "hostname" is the name of owner of the filesystem. For example, hostB
> will mount hostA:/local as /l/lhostA.
> 
> No fancy fail-over or anything like that.
> 
> > Best,
> > 
> > ellis
> > 
> > P.S., apologies for the top-post last time around.
> 
> NO worries. I'm so used to people doing that, in mailing lists that I've
> become numb to it.
> 
> Prentice
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf

-- 
*************************************************************
Dr. Jörg Saßmannshausen, MRSC
University College London
Department of Chemistry
20 Gordon Street
London
WC1H 0AJ 

email: j.sassmannshausen at ucl.ac.uk
web: http://sassy.formativ.net

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 230 bytes
Desc: This is a digitally signed message part.
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20170419/186b1bb8/attachment.sig>