[Beowulf] commercial clusters

Fri Sep 29 20:10:29 PDT 2006

hmmm...  200 nodes writing to the same file.  That is a hard problem.
In all my testing of global FS's I haven't found one that is capable
of doing this while delivering good performance.  One might think that
that MPI-IO would deliver performance while writing to the same file
(on something like lustre) but in my experience, MPI-IO is more about
functionality not performance.

In any code that I write that needs lot of bandwidth, I always write
an n-m io routine.  That is, your n processor task can read the
previous m checkpoint-chunks (produced from an earlier m processor
job).  Then, when writing out the checkpoint or output file, you get
each process to open its own individual file and dump its data to it.
This gives you maximum bandwidth and stops meta-data thrashing on your
cluster FS.  It is also quite easy to write single-cpu tools which
concatenate the files together...

Alternatively, you can write a simple client-side FUSE file system
which sort of joins multiple NFS mounts together into a single FS.  In
this way, you can stripe your IO over multiple NFS mounts...  very
similar to the cluster file system that was present in the
Digital/Compaq SC machines.  In this fashon, your file in the FUSE FS
looks consistent and coherent while in the underlying nfs directories
you see your file split up into bits (file.1 file.2 file.3 file.4 etc
for a 4 nfs mount system).  A simple way to get your bandwidth up
(especially if your nfs mounts are coming in over different gig-e
nics) but still gives REALLY crap bandwidth when trying to have
multiple threads writing to the same file...

Try Lustre :)

> .Our big cluster is 2000+ plus nodes and we only have some 270TB (I say only cause we'e getting another 100TB+ SAN by year end) and we are able move a lot of production through that cluster and that's what it is all about.   We do have numbers cluster wide.  As we move to the new model, we have to deal with 200 nodes trying to write to one large file, so we need to explore ways of accomplishing this without affecting our production environment.  Any ideas?

-- 
Dr Stuart Midgley
sdm900 at gmail.com