[Beowulf] NFS & Scaling issues

Tue Apr 3 13:39:32 PDT 2007

Hi,

We are running a cluster of 180 diskless compute nodes. 60 of them have 
32 bit AMD Semptron processors and rest are  dual core AMD Athelon 64 
bit processors. 32 bit machines have 10/100 mbps and rest have gigabit 
ethernet cards. We have four file servers, each hosting around 3.5TB on 
SATA drives connected to 3Ware RAID controller cards configured on RAID 
10 array. These file servers are exporting the drives through NFS. Each 
file server is running 265 daemons for nfsd.

The file servers are mainly hosting large number of small files ranging 
from 256KB to 2 MB. The compute nodes are primarily doing a search 
through these files, so there is lot's of reading and some writing to 
the file servers.

Recently we started noticing very high (70-90%) wait states on the file 
servers when compute nodes. We have tried to optimize the NFS through 
increasing the number of daemons and the rsize and wsize but to no avail.

Can someone point us in the right direction as to how we should be 
trying to troubleshoot this problem.

PS: All the nodes are running SuSE 10.0 and servers are running SuSE10.0 
and 10.1 and all the drives are formatted with reiserfs.

thanks

-- 
Amrik