[Beowulf] motherboards for diskless nodes
Craig Tierney
ctierney at HPTI.com
Fri Feb 25 15:02:06 PST 2005
On Fri, 2005-02-25 at 15:18, Mark Hahn wrote:
> > > Reasons to run disks for physics work.
> > > 1. Large tmp files and checkpoints.
> >
> > Good reason, except when a node fails you lose your checkpoints.
>
> you means s/node/disk/ right? sure, but doing raid1 on a "diskless"
> node is not insane. though frankly, if your disk failure rate is
> that high, I'd probably do something like intermittently store
> checkpoints off-node.
Yes and no. If the node is down, it is a bit tough for your model
to progress. Raid1 works well enough in software so that you
don't need additional hardware except the disk.
>
> > It all ends up being a risk assessment. We have been up for close
> > to 6 months now. We have not had a failure of the NFS server. The
>
> I have two nothing-installed clusters; on in use for 2+ years,
> the other for about 8 months. the older one has never had an
> NFS-related problem of any kind (it's a dual-xeon with 2 u160
> channels and 3 disks on each; other than scsi, nothing gold-plated.)
> this cluster started out with 48 dual-xeons and a single 48pt
> 100bT switch with a gigabit uplink.
>
> the newer cluster has been noticably less stable, mainly because
> I've been lazy. in this cluster, there are 3 racks of 32 dual-opterons
> (fc2 x86_64) that netboot from a single head node. each rack has a
> gigabit switch which is 4x LACP'ed to a "top" switch, which has
> one measly gigabit to the head/fileserver. worse yet, the head/FS
> is a dual-opteron (good), but running a crappy old 2.4 ia32 kernel.
>
> as far as I can tell, you simply have to think a bit about the
> bandwidths involved. the first cluster has many nodes connected
> via thin pipes, aggregated through a switch to gigabit
> connecting to decent on-server bandwidth.
>
> the second cluster has lots more high-bandwidth nodes, connected
> through 12 incoming gigabits, bottlenecked down to a single
> connection to the head/file server (which is itself poorly configured).
>
> one obvious fix to the latter is to move some IO load onto
> a second fileserver, which I've done. great increase in stability,
> though enough IO from enough nodes can still cause problems.
> shortly I'll have logins, home directories and work/scratch all on
> separate servers.
>
> for a more scalable system, I would put a small fileserver in each rack,
> but still leave the compute nodes nothing-installed. I know that
> the folks at RQCHP/Sherbrooke have done something like this, very nicely,
> for their serial farm. it does mean you have a potentially significant
> number of other servers to manage, but they can be identically configured.
> heck, they could even net-boot and just grab a copy of the compute-node
> filesystems from a central source. the Sherbrooke solution involves
> smart automation of the per-rack server for staging user files as well
> (they're specifically trying to support parameterized montecarlo runs.)
Sandia does something similar to this with their CIT toolkit,
but it is still diskless. For every N nodes, they have an
NFS-redirector. It boots diskless, and caches all of the files
that the clients read. The clients hit the redirector, and not
the main filesystem.
If you do have a disk in these nodes, there are probably some
interesting things you can do with CacheFS when it becomes stable.
Craig
More information about the Beowulf
mailing list