Optimal number of nfsd? [was: Re: OT:nfs hangs with redhat 7.3]

Mon Jul 29 10:53:02 PDT 2002

On Mon, Jul 29, 2002 at 09:51:12AM -0700, Jim Meyer's all...
> Hello!
> 
> On Mon, 2002-07-29 at 05:33, Tim Wait wrote:
> > 
> > > there has been a very, very great deal of improvement in the kernel's
> > > MM and NFS since 7.3 was released.  you should consider getting/running
> > > 2.4.19-rc3.
> > 
> > In a possibly related issue, do you have enough nfsd daemons running?
> > Check /etc/init.d/nfs... this fixed it for me on an older version:
> > 
> > # Number of servers to be started by default
> > RPCNFSDCOUNT=36
> 
> Here's an excellent question: how does one calculate the optimal number
> of nfsd processes to have lying about?

If you are concerned about latency, you should see the last least-called-
upon NFSD process have the lowest time on CPU for its lifetime. Since
we have diskless cluster nodes here, we can see many hours of time on
the main NFSD and the first few children, and see but a few seconds
on the last one. (but then again we use freebsd, I'd assume Linux does
something similar - no reason to 'round robin' the workload to every
NFSD when there's only one current request).

However, we've also had the problem with 80 nodes landing on a script error,
spawning a ton of children basically thrashing the nodes nicely and I think
using up their swap (over the network... :) - imagine what 80 nodes to do 8
100Mbps networks on a single fileserver. The load shoots up to (# of NFSD
processes) -- we had over 60 NFSDs running at first... :) Getting back on the
box can be a challenge.  Setting nice levels only helps so much (but it does
help... I leave an emergency shell around at -20 and I actually put the last
40 or 50 NFSD's at nice level 20, but there's hardly anything else running on
the server that would use CPU so latency isnt a major issue - disk is so
much slower anyway).

Really though, if your calculations' timings depend on NFS not being slow,
then you might want to reconsider your design - getting more ram for caching
on nodes and server, or even being super radical and using NFS proxies (big
fat ram caches, or disk if your cluster is massive enough) - we use it only
for loading and logging with Gromacs as well as for scratch files with G98.
When things are running correctly, we see only about 20-30 Mbps (split
more or less evenly across 8 seperate 100Mbps nets), which is quite low for
our 6-drive 15k RPM SCSI LVD160 Raid 5 NFS server with 2x1.33Ghz Athlons on
it.

Which leads one to think that fewer NFSDs might not make much difference
anyway - what we saw, over time, was with machines running similar jobs, with
alot of NFSD's, we'd get large spikes of disk usage - because the NFSDs were
around to handle it immediately, there was little lag - and because the nodes
were all running similar jobs, we saw similar spikes later on when the time
came to go to disk again for the jobs - only the peaks would get slightly
flatter over time. In between were long periods of little access.

With fewer NFSDs, we saw that a few nodes would get lagged by a second or
so over the entire first write for the first spike, and the second
spike was much flatter. By the time 10 writing 'spikes' had gone by, things
had redistributed themselves that it was almost a constant load on the NFS
server. (The lifetime of a job was about equivalent to 50-100 'loops'
including one big write per). End result was we saw more even load on
the server, and the average job time was about 20-30s longer on average
(some finished in almost the same time, others were 3-4 minutes slower) -
but over a 1-2 day long job, this is hardly an issue. This strange
effect wasnt worth chasing down, it just gave me confidence that running
only 10 or 20 NFSDs was enough - if there's a big spike in requests
for disk, some clients will just have to wait a second, and I dont
have to run 250 NFSD clients to ensure there's no extra 2ms of lag for
the job.

But this of course is completely dependant on the access patterns for your
jobs - caching on the server is almost useless to me - there's lots of writing
of few very large files in my case (10s to 100s of MB), then re-reading of
those written (G98 and its what I suspect less-than-optimal-for-nfs
scratch file system), so for other systems, more NFSDs may cause a large
increase in performance if you have to r/w many small files that will fit
in cache ram instead of having to go to disk.

Interesting and weird things happen with parallel usage of resources -
more power, cpu or daemons doesnt always equal better performance.

Even one NFSD would work in our case (though I didnt get down to that
pathological a case) since 100% cpu is not required to service that level of
disk access.  Multiple NFSDs increases concurrency, doesnt magically make CPU
cycles appear out of thin are or decrease disk-access time (though there may
be some effects with having the disk re-order access to seperate blocks with
more requests coming in concurrently - may be a bit more efficient).

Originally, 4 NFSDs is what I used for a while because we had a problem
tracking down the bad script causing nodes to thrash. (In fact, IIRC, it wasnt
a script thrashing swap, it was an error in cloning node's /var dirs - a
shared wtmp and utmp is NOT a good thing - I think as they booted, the nodes
were writing to one of the two and other nodes would see this, and it turned
into a huge loop - basically shutting off the switches was the only way to get
time on the NFS server (and do a killall nfsd :) Now we just leave 12 or
16 ish NFSDs - I see the most idle one has less than 1% the cpu time of
the most busy, so its all good.

/kc

> 
> Thanks!
> 
> --j
> -- 
> Jim Meyer, Geek At Large                              purp at wildbrain.com
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Ken Chase, math at velocet.ca  *  Velocet Communications Inc.  *  Toronto, CANADA