[Beowulf] Need recommendation for a new 512 core Linux cluster

Wed Nov 7 17:18:08 PST 2007

> Hi, all.  I would like to know for that many cores, what kind of file
> system should we go with?

what kind of workload?
512c, these days, could be 64 nodes or fewer.  will you have a fast
(ie, > Gb) interconnect available?

> Currently we have a couple of clusters with
> around 100 cores and NFS seems to be ok but not great.

since nontrivial $ is involved, I'd recommend quantifying that a bit more.
for instance, suspend all your jobs and run a pure bandwidth test on your 
NFS server(s).  then resume all the jobs, but collect basic data on aggregate
IO (log vmstat, /proc/partitions, sar, even tcpdump) over a day or two

I'd say that achieving 2-500 MB/s aggregate to a single NFS server 
(assuming IB or 10G) is pretty easy today.  is that enough?

if it's not enough, can simply run 2-4 such servers?
yes, having a single namespace is very convenient, but these days, 
disks are embarassingly large and cheap.  even a very small, managable
and cheap single server is ~10 TB.  if your goals is a total of 40TB
but any one user will only use ~2TB, then there's not much downside 
(and some upsides) to implementing 40 as 4x10TB.

> We definitely
> need to put in place a parallel file system for this new cluster and I
> do not know which one I should go with?  Lustre, GFS, PVFS2 or what
> else?  Could you share your experiences regarding this aspect?

my experience with Lustre is via HP SFS, which means it's constrained 
by HP's particular hw and sw choices.  we have 4x 70TB and 1x 200TB SFS
system.  they are now reasonably stable (but didn't start out that way), 
and we don't appear to have bandwidth or througput problems.  (we have
>2000 users, very heterogenous, spanning serial to big mpi.)  our main
SFS clusters have 12-24 OSS's (content servers), with SATA-based storage
that's dual-path'ed.

metadata performance is annoying to users, since any nontrivial directory
takes noticably long to run ls in (users usually have ls='ls -FC', so 
listing means statting every entry.)  and if you have admin-type needs,
such as periodically profiling all files by user/size/hash, it's a real
problem to traverse 10-20M used inodes at ~700 stats/second.

> I also would like to know how many head nodes should I need to manage
> jobs and queues.

one.  do you have tons of very short jobs, or a very inefficient scheduler?
since a head node can be pretty cheap (2-4 GB, 2-4 cores), I'd probably
throw a coupl extra in.  you can burn an almost arbitrary amount of cycles
doing monitoring/logging-type things, and it's very handy to have a node
to failover scheduler/logging/monitoring-type services onto.  I would
definitely keep such admin nodes distinct from either fileserver or
user-login nodes.

regards, mark hahn.