[Beowulf] Building a new cluster - seeking some advice

Fri Dec 21 16:12:14 PST 2007

Thomas Carroll wrote:
> 1. I'd like to go diskless.  I've never done this before (the other two
> clusters are...diskful?).  I've used Fedora on both of the previous
> clusters.  Is this a good choice for diskless?  Any advice on where to
> start with diskless or operating system choices?

I know RHEL has support for diskless, I've talked to people who used it.
In general if you are familiar with PXE boot, DHCP, initrd, ram disks, and
related it's relatively straight forward.  If that kind of stuff scares
you I'd consider spending $40 per node on a cheap disk.  I've not tried
this recently but at the time I did see less stability without local
swap.  Swap over network can be a bit tricky, sometimes a network transfer
involves allocating a page, and if you are swapping you might not have
one.  I've seen a project or two for network block layers to handle this,
no idea if any of them are current.

No idea on diskless for fedora, I suspect someone will comment.

In any case my strategy was readonly share /, then a per machine read/write
/var.  So the head node had a /var/host1, /var/host2, ....  So things
like ntp.drift, ssh session keys, numerous tmp files in /tmp and /var/temp
wouldn't conflict across the cluster.  I ended up using this for lab
machines it was quite amusing to watch a hacker try to hack binaries
on a NFS client that the nfs server considered readonly.

My overhead per client was just a few 10's of MBs on the head node's disk.

So the head node basically had 2 installs, one for the head node, and
one for all the compute nodes.  As well as 2 RPM databases.  I didn't use
or maintain the RPM database on the client nodes since they didn't really
have their own filesystem.  Thankfully /dev is no longer on the local disk
so that is not a problem.

You will likely need more than the default 8 NFS daemons on the head node.

> 2. Given my budget (about 20K), I plan on going with GigE on about 24
> nodes.  Am I right in thinking that faster network interconnects are
> just too expensive for this budget?

I suspect so, I'll be interested to hear if others suggest something
where the switch, cables, and nics don't eat most of the budget.  I
guess the cheapest IB cards + switch might be low enough to let you
still buy more nodes than GigE would scale to for a certain code.  Any
idea if your code is communication intensive enough so that 12 IB with
quad core CPUs might be faster than 24 dual core nodes with GigE?

> 3. I'll be spending most of my cluster's time diagonalizing large
> matrices.  I plan on using ScaLAPACK eventually; currently I just use
> LAPACK/ATLAS and do individual matrices on each node.  The only thing
> parallel about my code right now is using the nodes for monte carlo.
> This is what I'm looking at right now for my compute nodes:
> 	* Intel Core 2 Duo E6850 Conroe 3.0GHz ($280)

Hmm, a Q6600 quad 2.4 GHz is the same price, at least for some codes I'd
expect it to have more throughput than a dual core 3.0 GHz.  Of course
if the network is the bottleneck it won't help.

> 	* 8 GB (4 X 2 GB) DDR2 800 (~$200)

So keep the same memory per code with the q660 you'd have to double that, do
you need 4GB per core?  I suspect cheap motherboards will not allow 8x2GB
(for the same memory per core with a quad core).

> 	* Case/PSU combo ($60)
> 	* ATX motherboard w/GigE (~$100)
> This comes out to about 17k for 24 nodes + spare parts + hard drives for
> the head node.  I've already purchased the switch and cables and have
> more than adequate cooling and shelving for the room.
> 
>   The motherboard does NOT have integrated video.  Will I need video
> output?  Can you even build a node without it?  Problem is, the

To get a single node running diskless most likely, of the next 3 years
after it runs, probably not.   I'd get two for debugging, it is kinda
nice to have a console when the kernel oops is, but there is a network
block layer in the kernel that can usually handle sending an oops remotely
(that would normally only go to the consoles screen).

> motherboards with adequate support for 8GB memory and 1333 FSB don't

I'm all for the faster FSB, but you might test to see if the performance
improves, from what I can see for the same $ the 1333 FSB often is the
same latency, but only somewhat higher performance.  If it adds much
cost or less flexibility I'd at least look at the 1066 FSB motherboards.
Alas I think the date where ddr2-533 is cheaper than ddr2-667 has past.

> have video.  I could spend $10-20 per node for a video card, but that
> seems like a waste.  From reading around, it seems like there is no
> advantage really to DDR3 memory...is that right?  Any advice on the

It's what I've read as well, not tried it myself yet.

> video issue or my potential parts list would be greatly appreciated.

Try to get motherboards without fans.

> Thanks so much for any advice.  Feel free to offer unsolicited advice as
> well :).  And I hope everyone has, or has already had, a good holiday!

Seems plausible, keep in mind even if your time might be cheap/free there's
plenty of other things to do usually.  If disk drives get your cluster in
production a week sooner is that worth $40 a node?  Is local swap of any
value? How about local disk I/O for anything disk I/O intensive?  I've done
diskless myself and not regretted it.  Yet clusters I build these days
(usually from a vendor not from a stack of parts) have disks.