[Beowulf] motherboards for diskless nodes

Fri Feb 25 14:18:16 PST 2005

> > Reasons to run disks for physics work.
> > 1. Large tmp files and checkpoints.
> 
> Good reason, except when a node fails you lose your checkpoints.

you means s/node/disk/ right?  sure, but doing raid1 on a "diskless"
node is not insane.  though frankly, if your disk failure rate is 
that high, I'd probably do something like intermittently store
checkpoints off-node.

> It all ends up being a risk assessment.  We have been up for close
> to 6 months now.  We have not had a failure of the NFS server.  The

I have two nothing-installed clusters; on in use for 2+ years,
the other for about 8 months.  the older one has never had an
NFS-related problem of any kind (it's a dual-xeon with 2 u160
channels and 3 disks on each; other than scsi, nothing gold-plated.)
this cluster started out with 48 dual-xeons and a single 48pt 
100bT switch with a gigabit uplink.

the newer cluster has been noticably less stable, mainly because 
I've been lazy.  in this cluster, there are 3 racks of 32 dual-opterons 
(fc2 x86_64) that netboot from a single head node.  each rack has a 
gigabit switch which is 4x LACP'ed to a "top" switch, which has 
one measly gigabit to the head/fileserver.  worse yet, the head/FS
is a dual-opteron (good), but running a crappy old 2.4 ia32 kernel.

as far as I can tell, you simply have to think a bit about the 
bandwidths involved.  the first cluster has many nodes connected
via thin pipes, aggregated through a switch to gigabit
connecting to decent on-server bandwidth.

the second cluster has lots more high-bandwidth nodes, connected 
through 12 incoming gigabits, bottlenecked down to a single 
connection to the head/file server (which is itself poorly configured).

one obvious fix to the latter is to move some IO load onto
a second fileserver, which I've done.  great increase in stability,
though enough IO from enough nodes can still cause problems.
shortly I'll have logins, home directories and work/scratch all on 
separate servers.

for a more scalable system, I would put a small fileserver in each rack, 
but still leave the compute nodes nothing-installed.  I know that 
the folks at RQCHP/Sherbrooke have done something like this, very nicely,
for their serial farm.  it does mean you have a potentially significant
number of other servers to manage, but they can be identically configured.
heck, they could even net-boot and just grab a copy of the compute-node
filesystems from a central source.  the Sherbrooke solution involves 
smart automation of the per-rack server for staging user files as well
(they're specifically trying to support parameterized montecarlo runs.)

regards, mark hahn.