[Beowulf] NFS shared file system

Sun Dec 4 09:55:48 PST 2005

> Each cluster had its own head node and its own cheap, in-house build
> RAID exported over GB NFS. Recently we combined the existing clusters

when it was in the 3x30 state, did you do any measurements of the raid's
internal performance, and performance when under "normal" load by the nodes?
also, have you characterized the IO load of your CFD application?

> into one and the first problem we have is with the mass storage,
> occasionally it cannot handle the IO load. My question is if I buy a
> commercial NAS what are the chances that after that I'll need to replace
> GB with Mirinet (e.g.)? 

well, the better question is why you got rid of two of the IO nodes - 
or did you?

> clusters but from what I read in this newsgroups my understanding is
> that 90 nodes is a small cluster and I didn't expect scalability
> problems at this level.

the traffic here is somewhat specialized, of course - people doing 16-node
clusters are not having any problems, and so don't speak up ;)

90 nodes is clearly enough to show real scaling problems if the load is 
reasonably intensive and from multiple nodes simultaneously.  is it safe 
to assume you've done the basic first steps in tuning (lots of nfsd's,
perhaps also higher AC parameters on the client side, probably not using
the default 32K packets?)

> If a commercial storage optimized for IO is a
> solution what is the price I'm facing? Any recomendations?

depends on what your IO goals are.  do you insist on a single filesystem
implemented across multiple server nodes?  if so, you have to look into 
cluster-fs things like Lustre, GPFS, HP's SFS, Panasys, etc.  the overhead
(dollars and brains) is nontrivial.

I would probably split the workload across three independent NFS's,
and also try some basic tuning.  these are cheap, easy to do and will
definitely improve performance.

more speculative things:

	- use LACP or related techniques to provide more bandwidth out 
	of the NFS server(s).  this will probably not improve the bandwidth
	seen by a single node, but should come close to doubling the
	aggregate.

	- try out fscache - this is an add-on layer being promulgated by 
	RH which creates a local disk cache to unload your NFS.