[Beowulf] NFS HPC survey results.
Fred_Liu at issi.com
Thu Jul 21 02:22:52 PDT 2016
Thank you so much!
> -----Original Message-----
> From: Beowulf [mailto:beowulf-bounces at beowulf.org] On Behalf Of Bill
> Sent: 星期四, 七月 21, 2016 7:19
> To: Beowulf at beowulf.org
> Subject: [Beowulf] NFS HPC survey results.
> Many thanks for all the responses.
> Here's the promised raw data:
> I'll summarize the 26 results below. I'll email similar to those that asked.
> Not everyone answered all questions.
> 1) cluster OS:
> 72% Redhat/CentOS/Scientific linux or derivative
> 24% Debian/Ubuntu or derivative
> 4% SUSE or derivative
> 2) Appliance/NAS or linux server
> 32% NFS appliance
> 76% linux server
> 12% other (illumos/Solaris)
> 3) Appliances used (one each, free form answers):
> * Hitachi BlueARC, EMC Isilon, DDN/GPFS, x4540
> * Not sure - something that corporate provided. An F5, maybe...? Also a
> Panasas system for /scratch.
> * NetApp FAS6xxx
> * netapp
> * isilon x and nl
> * Isilon
> * NetApp
> * Synology
> 4) Which kernel do you use:
> 88% one provided with the linux distribution
> 12% one that I compile/tweak myself
> 5) what kernel changes do you make
> * CPU performance tweaking, network performance.
> * raise ARP cache size, newer kernel than stock 3.2 was needed for newer
> hardware 3.14 at the moment
> * ZFS
> 6) Do you often see problems like nfs: server 192.168.5.30 not responding,
> timed out:
> 42.3% Never
> 23.1% Sometimes
> 19.2% rarely
> 7.7% daily
> 7.7% often
> 7) If you see NFS time outs what do you do (free form answers)
> * nothing
> * nothing
> * Restart NFSd, look for performance intensive jobs, sometimes increase
> * Look at what's going on on that server. That means looking at what the
> disks are doing, what network flows are going to/from that server and
> determine if the load is something to take action on or to let.
> * Not much
> * Reboot
> * Resolve connectivity issue if any and run mount command on nodes. If this
> doesn't fix it, then reboot.
> * Ignore them, unless they become a problem.
> * Look for the root cause of the issue, typically system is suffering network
> issues or is overloaded by a user 'abuse/missuse'.
> * diagnose and identify underlying cause
> * Try to figure out who is overloading the NFS server (hard job)
> * Troubleshoot, typically a machine is offline or network saturation
> 8) which NFS options do you use (free form):
> * tcp,async,nodev,nosuid,rsize=32768,wsize=32768,timeout=10
> * nfsvers=3,nolock,hard,intr,timeo=16,retrans=8
> * hard,intr,rsize=32768,wsize=32768
> * all default
> * async
> * async,nodev,nosuid,rsize=32768,wsize=32768
> * tcp,async, nodev, nosuid,timeout=10
> * -rw,intr,nosuid,proto=tcp (mostly. Could be "ro" and/or "suid")
> * rsize=32768,wsize=32768
> * -nobrowse,intr,rsize=32768,wsize=32768,vers=3
> * udp,hard,timeo=50,retrans=7,intr,bg,rsize=8192,wsize=8192,nfsvers=3,
> * RHEL defaults
> * default ones, they're almost always the best ones
> * rw,nosuid,nodev,tcp,hard,intr,vers=4
> * defaults, netdev,vers=3
> * nfsvers=3,tcp,rw,hard,intr,timeo=600,retrans=2
> * rw,hard,tcp,nfsvers=3,noacl,nolock
> * default rhel6 (+nosuid, nodev, and sometimes nfsver=3)
> * tcp, intr, noauto, timeout, rsize, wsize, auto
> * nfsvers=3,rsize=1024,wsize=1024,cto
> 9) Any explanations:
> * We have not yet made the change to nfsv4, we use nolock due to various
> application "issues", we do not hard set rsize/wsize as they have been
> negotiating better values for a number of years on their own under v3,
> and the timeout/retrans are a bit of a legacy set of values from working
> this issue of server overload. Hard was a choice on our end to pick that
> having things hang definitely seemed better then having things fail and
> stale. We still agree with the choice of hard. Intr just helps to
> "interupt" stuck things when needed.
> * We like to be able to ctrl-C hung processes. For some systems we use
> rsize/wsize if the vendor supports it.
> * works for me without tewaks
> * We didn't use tcp until the last couple of years.
> * Probably needs a revisit- block size was set up for 2.x series kernels
> * default of centos 7
> * nfsv4 was not stable enough last time out, don't fix rsize/wsize as
> client/server usually negotiate to 1M anyway
> * We have frequent power outage (5+ times a year) and noauto helps our
> not to
> hang on mounting nfs shares. Drawback is you have to manually mount.
> out helps with this issue as well.
> * These are adjusted if necessary for particular workloads
> 10) what parts of the file system do you use NFS for (free form):
> * /home
> * /home
> * /home
> * /home
> * /home
> * /home
> * /home
> * /home and /apps
> * We use NFS for the OS (NFSRoot), App tree, $HOME, Group dedicated
> space, as
> well as some of our scratch spaces. All of these come from different NFS
> * /home, /apps
> * /home /opt /etc /usr /boot
> * /home,/apps,
> * /home, /apps, /scratch - all of 'em
> * /home, long term project storage, shared software
> * /cluster/home,/cluster/local,/cluster/scratch,/cluster/data
> * home, apps, shared data
> * /usr/local, /home
> * /home , /apps
> * various
> * /home, /group, /usr/local
> * /home, parts of /opt, some specific top level auto-mountable dirs
> * What above is called /apps and /home for a few medium sized systems
> * /home, /local, /opt, /diskless
> * /home, /opt, diskless node images
> 11) How many nodes can mount a single NFS server at once:
> 24% >= 512 nodes
> 20% 65-128 nodes
> 16% 1-16 nodes
> 12% 17-32 nodes
> 12% 257-512 nodes
> 12% 129-256 nodes
> 4% 33-64 nodes
> 12) How many NFSd daemons do you run per NFS server
> 45.0% 1-16
> 13.6% 129-256
> 13.6% 65-128
> 9.1% 33-64
> 4.5% 17-32
> 4.5% 256-512
> 4.5% 512-1024
> 4.5% 2048-4096
> 13) Do you use NFSd or user space
> 81.0% Kernel NFSd
> 14.3% User space
> 4.8% Both
> 14) What interconnect do you use with NFS?
> 38.5% 10G
> 26.9% GigE
> 23.1% IB
> 11.5% Other
> 15) If IB what transport (10 responses)
> 100% IPoIB
> 0% Other
> 16) If IB, do you use connected mode (8 responses)
> 65.5% Connected mode
> 37.5% Don't use connected mode
> 17) Do you use UDP or TCP (25 responses)
> 84% TCP
> 12% UDP
> 4% Other
> 18) Which other network file systems do you use? (24 responses)
> 0% PNFS
> 58.3% Lustre
> 16.7% Ceph
> 12.5% BeeGFS
> 12.5% GlusterFS
> 8.3% None (Panansas, GPFS, HSM/SAM/QFS, or more than one of the
> 19) Are the other network file systems more or less reliable than NFS?
> 58.3% Similar
> 16.7% I use only NFS
> 12.5% Much more reliable
> 4.2% Much less reliable
> 4.2% Somewhat less reliable
> 4.2% Somewhat more reliable
> 20) Do you support MPI-IO (not just MPI)
> 70.8% no
> 20.8% yes
> 8.3% (yes, but nobody uses it)
> 21) Any tips for making NFS perform better or more reliably?
> * We start with the underlying block (raid/disks) setup that you are going to
> serve data out and plumb up from there. The key things here is choosing your
> raid stride/chunk sizes and insuring your file system is as aware of the raid
> layout for good alignment as you can. We do follow the esnet host tuning found
> at: http://fasterdata.es.net/host-tuning/linux/ on both client and server
> systems. We also bump up the rpc.mountd count to help insure successful
> mounts as we use autofs to mount a number of the nfs spaces. When a larger
> HPC job starts up on many nodes we did have a time where not all would be
> able to mount successfully if the server was under load. Increasing the
> rpc.mountd count helped. We also set async and wdelay on our exports on the
> * Kernel settings
> * I've heard that configuring IB in RDMA boosts NFS performance
> * We don't use NFS for high performance cluster data. That's Lustre's
> Where NFS is used for scientific data, it's in places where there are modest
> numbers of concurrent clients.
> * more disks
> * RPCMOUNTDOPTS="--num-threads=64"
> * Try to optimize /etc/sysconfig/nfs as much as possible.
> 22) Any tips for making NFS clients perform better or more reliably?
> * Following the above mentioned esnet info at:
> http://fasterdata.es.net/host-tuning/linux/. I should note that for both client
> and server that are using IPoIB we use connected mode and set the MTU to
> * Reducing the size of the kernel dirty buffer on the clients makes
> performance much more consistent.
> * user reliable interconnect hw
> * We've tried scripting NFS mounts w/o much success.
> * Educate users on using the right filesystem for the right task
> 23) Anything you would like to add:
> * We have also seen input from others that they see gains with the client
> option of 'nocto'. The man pages would suggest this has some risks so while we
> have tested and can see that certain loads see a gain from this we have not yet
> moved forward to deploy this option on our general setup. We are in process of
> testing our apps to insure we do not create other issues for apps if we do use
> this flag. Another things we have been looking at is cachefilesd and seeing
> how well that helps for data that can easily be cached. For things like our
> application trees, the OS (we are NFSRoot booted), and even some user
> reference data sets this looks quite promising but we have not gone live with
> this yet either.
> * We're always looking to improve our environment as well. We don't
> always have TIME to do so, of course.
> * Horses for courses. NFS is great for shared software and home
> It's pretty useless for high performance access from hundreds of compute
> * Every storage system / file system I've ever seen or used has had its
> problems. There is no silver bullet (afaik). Use that which you have the
> competence to handle.
> * We are currently struggling with NFS mounts. We use them extensively
> throughout our department. Problems are they hang constantly and when one
> person is using the share heavily it slows down other computers. We've done
> lots of research into optimizing NFS but always come back to the same issues
> (hanging mounts that don't recover w/o admin interaction). We would love to
> know what other people are doing. We are experimenting with ceph at the
> moment for future large storage needs.
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
More information about the Beowulf