[Beowulf] Troubleshooting NFS stale file handles
j.sassmannshausen at ucl.ac.uk
Wed Apr 19 12:21:00 PDT 2017
three questions (not necessarily to you and it can be dealt with in a different
- why automount and not a static mount?
- do I get that right that the nodes itself export shares to other nodes?
- has anything changed? I am thinking of something like more nodes added, new
programs being installed, more users added, generally a higher load on the
One problem I had in the past with my 112 node cluster where I am exporting
/home, /opt and one directory in /usr/local to all the nodes from the headnode
was that the NFS-server on the headnode did not have enough spare servers
assigned and thus was running out of capacity. That also lead to strange
behaviour which I fixed by increasing the numbers of spare servers.
The way I have done that was setting this in
# Number of servers to start up
That seems to provide the right amount of servers and spare ones for me.
Like in your case, the cluster was running stable until I added more nodes
*and* users decided to use them, i.e. the load of the cluster got up. A more
idle cluster did not show any problems, a cluster under 80 % load suddenly had
I hope that helps a bit. I am not the expert in NFS as well and this is just
my experience. I am also using Debian nfs-kernel-server 1:1.2.6-4 if that
All the best from a sunny London
On Mittwoch 19 April 2017 Prentice Bisbal wrote:
> On 04/19/2017 02:17 PM, Ellis H. Wilson III wrote:
> > On 04/19/2017 02:11 PM, Prentice Bisbal wrote:
> >> Thanks for the suggestion(s). Just this morning I started considering
> >> the network as a possible source of error. My stale file handle errors
> >> are easily fixed by just restarting the nfs servers with 'service nfs
> >> restart', so they aren't as severe you describe.
> > If a restart on solely the /server-side/ gets you back into a good
> > state this is an interesting tidbit.
> That is correct, restarting NFS on the server-side is all it takes to
> fix the problem
> > Do you have some form of HA setup for NFS? Automatic failover
> > (sometimes setup with IP aliasing) in the face of network hiccups can
> > occasionally goof the clients if they aren't setup properly to keep up
> > with the change. A restart of the server will likely revert back to
> > using the primary, resulting in the clients thinking everything is
> > back up and healthy again. This situation varies so much between
> > vendors it's hard to say much more without more details on your setup.
> My setup isn't nearly that complicated. Every node in this cluster has a
> /local directory that is shared out to the other nodes in the cluster.
> The other nodes automount this by remote directory as /l/hostname, where
> "hostname" is the name of owner of the filesystem. For example, hostB
> will mount hostA:/local as /l/lhostA.
> No fancy fail-over or anything like that.
> > Best,
> > ellis
> > P.S., apologies for the top-post last time around.
> NO worries. I'm so used to people doing that, in mailing lists that I've
> become numb to it.
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
Dr. Jörg Saßmannshausen, MRSC
University College London
Department of Chemistry
20 Gordon Street
email: j.sassmannshausen at ucl.ac.uk
Please avoid sending me Word or PowerPoint attachments.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 230 bytes
Desc: This is a digitally signed message part.
More information about the Beowulf