[Beowulf] NFS HPC survey results.

Thu Jul 21 02:22:52 PDT 2016

Very informative!

Thank you so much!

Fred

> -----Original Message-----
> From: Beowulf [mailto:beowulf-bounces at beowulf.org] On Behalf Of Bill
> Broadley
> Sent: 星期四, 七月 21, 2016 7:19
> To: Beowulf at beowulf.org
> Subject: [Beowulf] NFS HPC survey results.
> 
> 
> Many thanks for all the responses.
> 
> Here's the promised raw data:
>     https://wiki.cse.ucdavis.edu/_media/wiki:linux-hpc-nfs-survey.csv
> 
> I'll summarize the 26 results below.  I'll email similar to those that asked.
> 
> Not everyone answered all questions.
> 
> 1) cluster OS:
>    72% Redhat/CentOS/Scientific linux or derivative
>    24% Debian/Ubuntu or derivative
>     4% SUSE or derivative
> 
> 2) Appliance/NAS or linux server
>     32% NFS appliance
>     76% linux server
>     12% other (illumos/Solaris)
> 
> 3) Appliances used (one each, free form answers):
>     * Hitachi BlueARC, EMC Isilon, DDN/GPFS, x4540
>     * Not sure - something that corporate provided. An F5, maybe...? Also a
>         Panasas system for /scratch.
>     * NetApp FAS6xxx
>     * netapp
>     * isilon x and nl
>     * Isilon
>     * NetApp
>     * Synology
> 
> 4) Which kernel do you use:
>     88% one provided with the linux distribution
>     12% one that I compile/tweak myself
> 
> 5) what kernel changes do you make
>     * CPU performance tweaking, network performance.
>     * raise ARP cache size, newer kernel than stock 3.2 was needed for newer
>       hardware 3.14 at the moment
>     * ZFS
> 
> 6) Do you often see problems like nfs: server 192.168.5.30 not responding,
>     timed out:
>     42.3% Never
>     23.1% Sometimes
>     19.2% rarely
>      7.7% daily
>      7.7% often
> 
> 7) If you see NFS time outs what do you do (free form answers)
>    * nothing
>    * nothing
>    * Restart NFSd, look for performance intensive jobs, sometimes increase
> NFSd.
>    * Look at what's going on on that server. That means looking at what the
>      disks are doing, what network flows are going to/from that server and
>      determine if the load is something to take action on or to let.
>    * Not much
>    * Reboot
>    * Resolve connectivity issue if any and run mount command on nodes. If this
>      doesn't fix it, then reboot.
>    * Ignore them, unless they become a problem.
>    * Look for the root cause of the issue, typically system is suffering network
>      issues or is overloaded by a user 'abuse/missuse'.
>    * diagnose and identify underlying cause
>    * Try to figure out who is overloading the NFS server (hard job)
>    * Troubleshoot, typically a machine is offline or network saturation
> 
> 8) which NFS options do you use (free form):
>    * tcp,async,nodev,nosuid,rsize=32768,wsize=32768,timeout=10
>    * nfsvers=3,nolock,hard,intr,timeo=16,retrans=8
>    * hard,intr,rsize=32768,wsize=32768
>    * all default
>    * async
>    * async,nodev,nosuid,rsize=32768,wsize=32768
>    * tcp,async, nodev, nosuid,timeout=10
>    * -rw,intr,nosuid,proto=tcp (mostly. Could be "ro" and/or "suid")
>    *
> rsize=32768,wsize=32768,hard,intr,vers=3,proto=tcp,retrans=2,timeo=600
>    * rsize=32768,wsize=32768
>    * -nobrowse,intr,rsize=32768,wsize=32768,vers=3
>    * udp,hard,timeo=50,retrans=7,intr,bg,rsize=8192,wsize=8192,nfsvers=3,
>      mountvers=3
>    * RHEL defaults
>    * default ones, they're almost always the best ones
>    * rw,nosuid,nodev,tcp,hard,intr,vers=4
>    *
> rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=t
> cp,
>      port=0,timeo=600,retrans=2,sec=sys,
> clientaddr=10.5.6.7,local_lock=none,
>      addr=10.5.6.1
>    * defaults, netdev,vers=3
>    * nfsvers=3,tcp,rw,hard,intr,timeo=600,retrans=2
>    * rw,hard,tcp,nfsvers=3,noacl,nolock
>    * default rhel6 (+nosuid, nodev, and sometimes nfsver=3)
>    * tcp, intr, noauto, timeout, rsize, wsize, auto
>    * nfsvers=3,rsize=1024,wsize=1024,cto
> 
> 9) Any explanations:
>    * We have not yet made the change to nfsv4, we use nolock due to various
>      application "issues", we do not hard set rsize/wsize as they have been
>      negotiating better values for a number of years on their own under v3,
>      and the timeout/retrans are a bit of a legacy set of values from working
> on
>      this issue of server overload. Hard was a choice on our end to pick that
>      having things hang definitely seemed better then having things fail and
> go
>      stale. We still agree with the choice of hard. Intr just helps to
>      "interupt" stuck things when needed.
>    * We like to be able to ctrl-C hung processes. For some systems we use
> larger
>      rsize/wsize if the vendor supports it.
>    * works for me without tewaks
>    * We didn't use tcp until the last couple of years.
>    * Probably needs a revisit- block size was set up for 2.x series kernels
>    * default of centos 7
>    * nfsv4 was not stable enough last time out, don't fix rsize/wsize as
>      client/server usually negotiate to 1M anyway
>    * We have frequent power outage (5+ times a year) and noauto helps our
> not to
>      hang on mounting nfs shares. Drawback is you have to manually mount.
> Time
>      out helps with this issue as well.
>    * These are adjusted if necessary for particular workloads
> 
> 10) what parts of the file system do you use NFS for (free form):
>    * /home
>    * /home
>    * /home
>    * /home
>    * /home
>    * /home
>    * /home
>    * /home and /apps
>    * We use NFS for the OS (NFSRoot), App tree, $HOME, Group dedicated
> space, as
>      well as some of our scratch spaces. All of these come from different NFS
>      servers.
>    * /home, /apps
>    * /home /opt /etc /usr /boot
>    * /home,/apps,
>    * /home, /apps, /scratch - all of 'em
>    * /home, long term project storage, shared software
>    * /cluster/home,/cluster/local,/cluster/scratch,/cluster/data
>    * home, apps, shared data
>    * /usr/local, /home
>    * /home , /apps
>    * various
>    * /home, /group, /usr/local
>    * /home, parts of /opt, some specific top level auto-mountable dirs
>    * What above is called /apps and /home for a few medium sized systems
>    * /home, /local, /opt, /diskless
>    * /home, /opt, diskless node images
> 
> 11) How many nodes can mount a single NFS server at once:
>     24% >= 512 nodes
>     20% 65-128 nodes
>     16% 1-16 nodes
>     12% 17-32 nodes
>     12% 257-512 nodes
>     12% 129-256 nodes
>      4% 33-64 nodes
> 
> 12) How many NFSd daemons do you run per NFS server
>      45.0% 1-16
>      13.6% 129-256
>      13.6% 65-128
>       9.1% 33-64
>       4.5% 17-32
>       4.5% 256-512
>       4.5% 512-1024
>       4.5% 2048-4096
> 
> 13) Do you use NFSd or user space
>      81.0% Kernel NFSd
>      14.3% User space
>       4.8% Both
> 
> 14) What interconnect do you use with NFS?
>      38.5% 10G
>      26.9% GigE
>      23.1% IB
>      11.5% Other
> 
> 15) If IB what transport (10 responses)
>      100% IPoIB
>         0% Other
> 
> 16) If IB, do you use connected mode (8 responses)
>      65.5% Connected mode
>      37.5% Don't use connected mode
> 
> 17) Do you use UDP or TCP (25 responses)
>      84% TCP
>      12% UDP
>       4% Other
> 
> 18) Which other network file systems do you use? (24 responses)
>      0% PNFS
>      58.3% Lustre
>      16.7% Ceph
>      12.5% BeeGFS
>      12.5% GlusterFS
>       8.3% None (Panansas, GPFS, HSM/SAM/QFS, or more than one of the
> above)
> 
> 19) Are the other network file systems more or less reliable than NFS?
>      58.3% Similar
>      16.7% I use only NFS
>      12.5% Much more reliable
>       4.2% Much less reliable
>       4.2% Somewhat less reliable
>       4.2% Somewhat more reliable
> 
> 20) Do you support MPI-IO (not just MPI)
>      70.8% no
>      20.8% yes
>       8.3% (yes, but nobody uses it)
> 
> 21) Any tips for making NFS perform better or more reliably?
>    * We start with the underlying block (raid/disks) setup that you are going to
> serve data out and plumb up from there. The key things here is choosing your
> raid stride/chunk sizes and insuring your file system is as aware of the raid
> layout for good alignment as you can. We do follow the esnet host tuning found
> at: http://fasterdata.es.net/host-tuning/linux/ on both client and server
> systems. We also bump up the rpc.mountd count to help insure successful
> mounts as we use autofs to mount a number of the nfs spaces. When a larger
> HPC job starts up on many nodes we did have a time where not all would be
> able to mount successfully if the server was under load. Increasing the
> rpc.mountd count helped. We also set async and wdelay on our exports on the
> servers.
>    * Kernel settings
>    * I've heard that configuring IB in RDMA boosts NFS performance
>    * We don't use NFS for high performance cluster data. That's Lustre's
> world.
> Where NFS is used for scientific data, it's in places where there are modest
> numbers of concurrent clients.
>    * more disks
>    * RPCMOUNTDOPTS="--num-threads=64"
>    * Try to optimize /etc/sysconfig/nfs as much as possible.
> 
> 22) Any tips for making NFS clients perform better or more reliably?
>    * Following the above mentioned esnet info at:
> http://fasterdata.es.net/host-tuning/linux/. I should note that for both client
> and server that are using IPoIB we use connected mode and set the MTU to
> 64k.
>    * Reducing the size of the kernel dirty buffer on the clients makes
> performance much more consistent.
>    * user reliable interconnect hw
>    * We've tried scripting NFS mounts w/o much success.
>    * Educate users on using the right filesystem for the right task
> 
> 23) Anything you would like to add:
> 
>     * We have also seen input from others that they see gains with the client
> option of 'nocto'. The man pages would suggest this has some risks so while we
> have tested and can see that certain loads see a gain from this we have not yet
> moved forward to deploy this option on our general setup. We are in process of
> testing our apps to insure we do not create other issues for apps if we do use
> this flag.  Another things we have been looking at is cachefilesd and seeing
> how well that helps for data that can easily be cached. For things like our
> application trees, the OS (we are NFSRoot booted), and even some user
> reference data sets this looks quite promising but we have not gone live with
> this yet either.
>    * We're always looking to improve our environment as well. We don't
> always have TIME to do so, of course.
>    * Horses for courses. NFS is great for shared software and home
> directories.
> It's pretty useless for high performance access from hundreds of compute
> nodes.
>    * Every storage system / file system I've ever seen or used has had its
> problems. There is no silver bullet (afaik). Use that which you have the
> competence to handle.
>    * We are currently struggling with NFS mounts. We use them extensively
> throughout our department. Problems are they hang constantly and when one
> person is using the share heavily it slows down other computers. We've done
> lots of research into optimizing NFS but always come back to the same issues
> (hanging mounts that don't recover w/o admin interaction). We would love to
> know what other people are doing. We are experimenting with ceph at the
> moment for future large storage needs.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf