[Beowulf] Transient NFS Problems in New Cluster

Henning Fehrmann henning.fehrmann at aei.mpg.de
Tue Feb 2 23:28:45 PST 2010


On Tue, Feb 02, 2010 at 02:00:37PM -0800, Jon Forrest wrote:
> I have a new cluster running CentOS 5.3.
> The cluster uses a Sun 7310 storage server
> that provides NFS service over a private
> 1Gb/s ethernet with 9K jumbo frames to the
> cluster.
> 
> We've noticed that a number of the compute
> nodes sometimes generate the
> 
> automount[15023]: umount_autofs_indirect: ask umount returned busy /home
> 
> message. When this happens the program running on the
> node dies. This has happened between 10 and 20 times.
> We're not sure what's going on on a node when this
> happens. Most of the time everything is fine and
> the home directories are automounted without problem.
> 
> I've googled for this problem and I see that other people
> have seen it too, but I've never seen a resolution,
> especially not for RHEL5.

I guess the problem has not directly something to do with RHEL5.

You might want to post this question to 
autofs at linux.kernel.org

They need to know the version of autofs and the kernel.

> 
> The auto.master line for this mount is
> 
> /home  /etc/auto.home  --timeout=1200

You could try to reduce the timeout. Nothing speaks against a timeout
of 60s. Many things can happen in 1200s - especially on the server side.

> noatime,nodiratime,rw,noacl,rsize=32768,wsize=32768

You could try nolock on the client side and async on the
server side. The user should take care that not two processes are
writing into the same files to avoid race conditions.

Cheers,
Henning



More information about the Beowulf mailing list