[Beowulf] motherboards for diskless nodes
Mark Hahn
hahn at physics.mcmaster.ca
Fri Feb 25 14:18:16 PST 2005
> > Reasons to run disks for physics work.
> > 1. Large tmp files and checkpoints.
>
> Good reason, except when a node fails you lose your checkpoints.
you means s/node/disk/ right? sure, but doing raid1 on a "diskless"
node is not insane. though frankly, if your disk failure rate is
that high, I'd probably do something like intermittently store
checkpoints off-node.
> It all ends up being a risk assessment. We have been up for close
> to 6 months now. We have not had a failure of the NFS server. The
I have two nothing-installed clusters; on in use for 2+ years,
the other for about 8 months. the older one has never had an
NFS-related problem of any kind (it's a dual-xeon with 2 u160
channels and 3 disks on each; other than scsi, nothing gold-plated.)
this cluster started out with 48 dual-xeons and a single 48pt
100bT switch with a gigabit uplink.
the newer cluster has been noticably less stable, mainly because
I've been lazy. in this cluster, there are 3 racks of 32 dual-opterons
(fc2 x86_64) that netboot from a single head node. each rack has a
gigabit switch which is 4x LACP'ed to a "top" switch, which has
one measly gigabit to the head/fileserver. worse yet, the head/FS
is a dual-opteron (good), but running a crappy old 2.4 ia32 kernel.
as far as I can tell, you simply have to think a bit about the
bandwidths involved. the first cluster has many nodes connected
via thin pipes, aggregated through a switch to gigabit
connecting to decent on-server bandwidth.
the second cluster has lots more high-bandwidth nodes, connected
through 12 incoming gigabits, bottlenecked down to a single
connection to the head/file server (which is itself poorly configured).
one obvious fix to the latter is to move some IO load onto
a second fileserver, which I've done. great increase in stability,
though enough IO from enough nodes can still cause problems.
shortly I'll have logins, home directories and work/scratch all on
separate servers.
for a more scalable system, I would put a small fileserver in each rack,
but still leave the compute nodes nothing-installed. I know that
the folks at RQCHP/Sherbrooke have done something like this, very nicely,
for their serial farm. it does mean you have a potentially significant
number of other servers to manage, but they can be identically configured.
heck, they could even net-boot and just grab a copy of the compute-node
filesystems from a central source. the Sherbrooke solution involves
smart automation of the per-rack server for staging user files as well
(they're specifically trying to support parameterized montecarlo runs.)
regards, mark hahn.
More information about the Beowulf
mailing list