[Beowulf] motherboards for diskless nodes

Fri Feb 25 15:02:06 PST 2005

On Fri, 2005-02-25 at 15:18, Mark Hahn wrote:
> > > Reasons to run disks for physics work.
> > > 1. Large tmp files and checkpoints.
> > 
> > Good reason, except when a node fails you lose your checkpoints.
> 
> you means s/node/disk/ right?  sure, but doing raid1 on a "diskless"
> node is not insane.  though frankly, if your disk failure rate is 
> that high, I'd probably do something like intermittently store
> checkpoints off-node.

Yes and no.  If the node is down, it is a bit tough for your model
to progress.  Raid1 works well enough in software so that you
don't need additional hardware except the disk.

> 
> > It all ends up being a risk assessment.  We have been up for close
> > to 6 months now.  We have not had a failure of the NFS server.  The
> 
> I have two nothing-installed clusters; on in use for 2+ years,
> the other for about 8 months.  the older one has never had an
> NFS-related problem of any kind (it's a dual-xeon with 2 u160
> channels and 3 disks on each; other than scsi, nothing gold-plated.)
> this cluster started out with 48 dual-xeons and a single 48pt 
> 100bT switch with a gigabit uplink.
> 
> the newer cluster has been noticably less stable, mainly because 
> I've been lazy.  in this cluster, there are 3 racks of 32 dual-opterons 
> (fc2 x86_64) that netboot from a single head node.  each rack has a 
> gigabit switch which is 4x LACP'ed to a "top" switch, which has 
> one measly gigabit to the head/fileserver.  worse yet, the head/FS
> is a dual-opteron (good), but running a crappy old 2.4 ia32 kernel.
> 
> as far as I can tell, you simply have to think a bit about the 
> bandwidths involved.  the first cluster has many nodes connected
> via thin pipes, aggregated through a switch to gigabit
> connecting to decent on-server bandwidth.
> 
> the second cluster has lots more high-bandwidth nodes, connected 
> through 12 incoming gigabits, bottlenecked down to a single 
> connection to the head/file server (which is itself poorly configured).
> 
> one obvious fix to the latter is to move some IO load onto
> a second fileserver, which I've done.  great increase in stability,
> though enough IO from enough nodes can still cause problems.
> shortly I'll have logins, home directories and work/scratch all on 
> separate servers.
> 
> for a more scalable system, I would put a small fileserver in each rack, 
> but still leave the compute nodes nothing-installed.  I know that 
> the folks at RQCHP/Sherbrooke have done something like this, very nicely,
> for their serial farm.  it does mean you have a potentially significant
> number of other servers to manage, but they can be identically configured.
> heck, they could even net-boot and just grab a copy of the compute-node
> filesystems from a central source.  the Sherbrooke solution involves 
> smart automation of the per-rack server for staging user files as well
> (they're specifically trying to support parameterized montecarlo runs.)

Sandia does something similar to this with their CIT toolkit,
but it is still diskless.  For every N nodes, they have an
NFS-redirector.  It boots diskless, and caches all of the files
that the clients read.  The clients hit the redirector, and not
the main filesystem.  

If you do have a disk in these nodes, there are probably some
interesting things you can do with CacheFS when it becomes stable.

Craig