[Beowulf] Setting up a new Beowulf cluster
Robert G. Brown
rgb at phy.duke.edu
Wed Feb 13 07:45:59 PST 2008
On Fri, 8 Feb 2008, Berkley Starks wrote:
> Thank you all so much for the advice so far. This has helped me see a few
> more of the things that I did not realize at first.
> For a little info on the project, I developed this project as a tool to work
> on my Senior Thesis in a year or so. Doing computational nuclear physics
> requires such resources. It will also be used heavily for Monte Carlo
> Simulations and just about any other form of computational physics. The two
> named are definite projects that are already on the line up for when I do
> get the cluster up and functional.
(Sorry about the delay, I'm busy busy busy:-)
OK, this (and the stuff below) makes your job relatively easy. I'm
going to guess that your application mix will almost certainly be
"embarrassingly parallel" at least at first -- lots of compute nodes
running MC simulations in nuclear physics (a situation we also have here
at Duke) plus people running random applications of one sort or another
in a sort of "compute farm" way. After you've head it for a few years,
you'll probably start to develop at least a few "real parallel"
applications, so we'll use a design that can segue into that, but to do
that "right" you'll have to deliberately engineer the cluster to fit the
task and will need an actual budget.
You'll need an actual budget to get started here, too, especially if you
want to build a cluster that is actually "useful". Here's the math.
According to Moore's Law (a scaling law for computing performance at
constant cost that has functioned at least approximately well for close
to maybe 45 years) compute power at constant cost has doubled roughly
every 18 months. That means that four year old machines, by the time
you get them, will be 2^3 = 8 times slower than a brand new machine that
costs just as much as they cost when they were new. Since machines --
amazingly powerful machines, like dual processor dual core 64 bit CPU
machines -- can be purchased for (say) $2000 give or take a bit
depending on what precisely you get on them and might be MORE than 8x
faster than an old 32-bit P6 machine, you're going to have the paradox
that some faculty desktops will be faster than your entire cluster.
To put it another way, while using old machines is fine for making a
learning cluster, it's going to suck in production, with a lot of work
and investment required to get to where you could go far more easily by
buying a single new desktop at modest cost.
The design I'm going to suggest for you (Geeze, I feel like Clinton on
What Not to Wear) is a tasteful cluster, one that is intially
surprisingly affordable as it gives you the opportunity to learn about
clustering and provide your nuclear group with "a" place to run jobs,
yet it can grow and change as your needs (and budget!) grow and change.
Let's budget it out.
Your cluster will need a home, and there are good homes and not so good
homes depending on its scale. Close to networking is good. In a rack
is great, although you can certainly get started on heavy duty steel
shelving. On a floor that is rated to support the weight of your
growing stack of hardware is key -- a fully loaded rack can be quite
heavy, and nothing ruins your day like having a rackful of expensive
hardware crash through the floor to land on the head of somebody one
floor down (or worse, break all the way through in a cascade effect down
to the basement). Hard on head, and likely to break all that expensive
equipment. Oh, and the building. Did I mention the lawsuits?
The three critical components required by your cluster in its physical
home are power and cooling and a network or network access. A "box" --
a cluster node containing one or more processors -- typically draws
between 100W minimum to around 250W, depending on how many processors it
has, how much memory, whether or not it has disk(s) or other
peripherals. This is rule of thumb, YMMV. At one point I would have
estimated 100W per CPU, but nowadays I think it is probably down to more
like 50-60W per CPU core (anybody have current numbers on actual
hardware to contribute)? If we assume that you'll get started with a
humble 8 contributed ancient P4's at 125W each, that's a kilowatt right
there. Add networking, add disks, add monitor(s), add a separate
server, and you're right up at the limit of a standard 20 ampere
circuit. This means that you will need AT LEAST one dedicated 20 Amp
120VAC circuit to run your cluster, and will need ADDITIONAL circuits as
your cluster grows. They don't all have to be handy when you get
started, but if you try to put the cluster onto an existing, already
half-loaded circuit it's going to trip breakers when you first power it
on and that's embarrassing so think ahead.
As a physics student, you will recall thermodynamics. All the power
consumed by cluster nodes appears shortly thereafter in their immediate
environment as heat. If you remove the heat as fast as it is generated,
the environment (and nodes) remain at a constant temperature. If not,
it gets hotter until thermal diffusion through e.g. the walls of the
space balance it out. Computers HATE to be hot. They express their
irritation by breaking, burning out early, actually malfunctioning and
throwing bit errors that ruin a computation. We want our cluster to be
cool and happy and last a long time and run reliably, so we want our
cluster space to be anywhere from cool to COLD. The rule is that
computer componets lose a year of expected lifetime for every 10 degrees
farenheit above an ambient air temperature of 68F (20C) which is a
"cool" temperature for an office. 60F is better still -- most server
rooms are maintained with ambient air temperature as cool as 50F (10C),
more likely ballpark 60F under load. Air conditioning capacity is
measured in "tons", where a ton of AC is a unit capable of removing the
heat required to transform a ton-sized block of ice from ice into water
at 32F (latent heat of fusion, work it out) which just happens to be
~3500 watts. You want to be able to stay AHEAD of the heat and actually
cool heat infiltration from outside, so you'll need more (25-35% more)
AC capacity in watts than you have power capacity in watts. You also
need to worry about air circulation, especially if you're building the
cluster in e.g. a closet (NOT recommended). A big open room gets a bit
of convective help and is better than a small closed space. The air
should be and remain dry.
Then the space needs networking. There are two aspects of this to
consider, and they're not separate for the cluster design I'm going to
suggest. One is the network required by the actual cluster nodes, which
communicate with each other (if needed) and the "master" node and other
workstations in the department (certainly) via network interconnects.
The other is the network connection to the rest of the department -- how
are people going to use the cluster? It is by far the easiest if they
can just start jobs up from their desktops, which means everything needs
to be on the same network. Minimally, then, the cluster space needs
>>a<< network wire running into it from the building networking closet
and connected to its presumed switch. Beyond that, there are several
ways to proceed, depending on local politics, who provides what, who
"owns and runs" what, and practical considerations.
For example, one scenario is that you upgrade the existing building
networking closet by adding a 48 port professional-grade gigabit
ethernet switch that is uplinked into the existing possibly slower
department switch. A nice fat bundle of cable is run from the
punchblocks in this closet back to your cluster space, and punched into
a panel of RJ45 ports in a rack in your cluster space. As you add
nodes, you simply cable them into this rack and add cables in the wiring
closet from the punch port back to the switch. This has many advantages
-- one being that you can hook (selected) faculty or office DESKTOPS
into the gigabit switch so they are on the same flat network,
effectively INCLUDING THEM IN THE CLUSTER. Since some of your faculty
-- the ones doing the MC computations, for example -- will have power
desktops that might equal or exceed the power of your initial cluster,
this gives EVERYBODY potential access to all of that power if you
establish a resource-sharing policy and can make your initial cluster
3-4x as powerful as it might otherwise be quite easily, especially if
you have spare cycles you can salvage on e.g. student clusters that are
idle all night.
Another scenario is that you get a single smaller gigabit switch for
your cluster, mount it in the rack or on the shelf of that cluster, and
have a single gigabit link back to the department network. This gives
you a bottleneck between the faculty desktops and the cluster, but for
embarrassingly parallel code it won't matter. I'm guessing this is the
way you will go initially, and you can always change over later, but
SOMETIMES if you dicker things like the former out now, you can get
other people to pay for them and end up with something really nice and
scalable for the future, or at least grease the way for later when you
need to go back and say you've outgrown the first effort and need to
reconsider. IF you ever get to where a higher end network is necessary
-- a "real" dedicated cluster network -- you'll probably need to use the
local switch architecture anyway, although you might well have both that
network and whatever switched gigE/TCP/IP network you started with at
the same time.
Anyway, enough on infrastructure. Let's talk about the cluster and what
you'll need to acquire or budget. I truly think that you're going to
need a budget of a few thousand dollars even to get started, although if
you can't get even that little an amount, well, we'll do what we can.
Cheapest Possible 2-8 Box Learning Cluster
Ingredients: Heavy duty steel shelving ($50 at home depot). 8-10 port
gigabit switch ($50 from numerous makers and vendors). 16 6' to 14'
patch cables, cable ties, 2 surge protector power strips, small work
table/bench, work chair -- scrounged if possible, $250 would buy it all
and a nice little pocket toolkit as well.
You will need a monitor and keyboard (and possibly a mouse) on the
workbench and connectable to the backs of each node on demand.
Scrounging is OK, you can get a nice flat panel that draws less power
(and makes less heat) and is a lot easier to move around for around
$200-250, a whole KVM setup for easily less than $300. You may want to
consider getting a small KVM switch to make it "easy" to switch between
consoles on nodes but this is a luxury item and really belongs in the
next description instead.
For nodes you take what you can scrounge and augment them by buying what
you can afford. You should be prepared to repair nodes, buy nodes
gigabit ethernet cards, and add memory or a disk to nodes, at cost or
from a "boneyard" of scavenged parts from systems that are DOA but have
usable memory chips or CPUs or power supplies that still work. Still,
I'm guessing you'll need a few hundred dollars absolute minimum in a
budget to get started. Your "free" nodes will only rarely turn out to
really be free; more often you'll have to drop maybe $50 into them to
add memory and networking (again, this cost and the differential cost of
power alone favors BUYING brand new nodes over fixing up old nodes --
THERE IS NO PRICE-PERFORMANCE WIN in going cheap, for all that it is
very informative and a great learning experience).
One node you will almost certainly want to buy, or build out of the best
of what you can scrounge. This is your cluster's "head", or "server"
node. I'm going to suggest a flat cluster design, so the latter is a
more reasonable description. This is a machine you fix up or purchase
* lots of memory, 1-2 GB if possible.
* multiple CPUs or CPU cores. 2-4 if possible.
* a "good" e.g. Intel gigabit ethernet interface, or even two.
* 3-4 largish disks, configured in an md raid level 5.
* a "good" graphics adapter -- one capable of running a graphical
display efficiently and at a decent resolution (which should of course
match up decently with the capability of your monitor, which I suggest
be capable of at least 1280x1024 and at least 17" diagonal).
This machine is the one that you set up with a full linux desktop and an
NFS exportable filesystem for /home and/or workspace on all of the
nodes. It MAY end up being a DHCP/PXE server (which may require that it
be on a private network in order not to fight with departmental servers
which in turn may require that it have that second ethernet interface),
a web server (to facilitate HTTP-driven PXE installs), a diskless node
server (if you go with a diskless node design to save money and power at
the expense of a somewhat steeper initial learning curve). In
master-slave computations it will likely be the master. In computations
run in "batches" it will be the place those jobs are submitted, and the
place users will visit to retrieve results. It will be the node you
"name" for the cluster (usually) where the nodes will usually have
abbreviated hostnames like b01, b02, b03...
I would budget a MINIMUM of $1500 for this node, purchased new, $2000
would be better. If you rebuild out of parts, you'll need to scrounge
an old system with a big enough tower to be able to hold 3-4 disks
(usually a mid-size tower will be a bit tight) with as fast a CPU as you
can manage and as much memory as you can afford to add and with 1-2 gigE
interfaces. I am not including backup devices in this cluster design --
This gives you (tallying things up) the need for at absolute minimum a
budget of $1000-$1500, which presumes that you scrounge nearly
everything but still need to buy disks, memory, spare parts, network
switch, with a bit leftover to handle server crashes and make life
comfortable. You'd do far better with a budget of $3500, buying
yourself a nice server/head node, setting up a nice working environment
and a much larger network switch from the beginning and still having
$1000 or so to fix up scrounged nodes.
NOTE WELL! As noted above, if your cluster is "flat" with the
department (linux) network, you can easily enough make your scheduler
distribute jobs out to individual (linux) desktops and include them "in"
your cluster using e.g. Condor as a resource manager. In fact, you can
make a "cluster" out of your existing linux LAN at no investment but
time and software configuration IF your department policy and so on
permit it. It often depends on who "owns" those desktops and what they
get out of it -- linux is perfectly capable of running a desktop
interactive session with somebody AND a background numerical task with
essentially no impact of the latter on the former -- desktop computing
rarely uses as much as 1% of a system's total compute capacity.
On to what I think of as a "better idea"
Inexpensive Starter Cluster with a Future
The good thing about the cluster above is that it is cheap. Oh, you can
go even cheaper. Take two systems, slap linux on them, pop them on any
old network and it is "a cluster" in that you can run computations on
both at the same time and add more nodes when you find them. Or just
look at your department linux lan, enable logins for all users on all
desktops, establish a policy for use or install a policy tool like
condor and go "poof! you're a cluster!". That's a description of my
own home cluster -- a flat switched network with lots of linux boxes
that are "a cluster" when I want them to be and desktops the rest of the
time, where I don't even bother with Condor (ownership being clear,
However, the BAD thing about it is that cheap as it is, anything built
with 4 year old hardware is a loser right out of the box. Seriously.
The differential cost of POWER ALONE over a single year will generally
buy you a single modern system that is as fast or faster than the entire
cluster. That's the bitchin' thing about Moore's Law -- there is no
sane afterlife for systems because it gets to where the cost of
operation alone exceeds the cost of replacement, and then we can do all
sorts of TCO computations and assessments of the cost of maintenace and
conclude that it is really really dumb to do this UNLESS other people
will pay for power no matter how much you use but not give you the money
to buy nodes. Which happens so often that it isn't funny, but it is
stupid nevertheless. Or for student/learning clusters, where you do
what you can and have NO budget but what you can raise at a bake sale.
I advise 2-3 students a year in that category, so I'm pretty sympathetic
to it, but I advise them to come up with a few thou a year budget
So here's "better" design. It costs more initially, but it will scale
nicely out to racks and racks of systems, and the systems you get will
always be boxes that your nuclear faculty will drool over and WANT to
run their jobs on -- so much so that initially they'll fight to get
time, and be properly motivated to write grants with an equipment budget
that contributes a few nodes a year or more to your collection.
Start by buying a nice, 43U, four post, open equipment rack. IIRC you
can get one for around $400 that will work just fine (don't get $1000+
ones with glass doors and whatever -- you're not made of money after
all). Get a nice 48 port professional-grade rackmount gigabit ethernet
switch for maybe $800. Get a few packets of ethernet cables, different
colors, in lengths from 6' to 14', velcro cable ties, maybe a rackmount
power distribution system (not necessarily a "UPS", mind you), cable
holders -- enough stuff to outfit your rack so that it can be kept
"pretty" -- and easy to maintain. This might cost another $500.
Into this put a nice rackmount raid system "head node" with maybe a TB
of storage capacity and a BACKUP system -- a tape library. Initially
you can "get by" with what amounts to an enhanced node with four disks
and no backup, but you'll have to warn your users that there is no
backup and that they are responsible for securing, copying, mirroring
their own valuable data elsewhere. Backup is expensive (which sucks)
but for a professional operation it is obviously essential. I'd budget
a MINIMUM of $2500 for the disk server alone, $4000-5000 for disk server
plus backup. These numbers are starting to get really soggy -- you'd
best get real quotes for exactly what you want to START with, then go
find the money and not the other way around lest you end up short!
Nodes are then added to the limit of your budget, ideally in a standard
form. These days I'd recommend dual processor quad core nodes for
CPU-bound Monte Carlo computations, dual-duals for codes that do a lot
of vector algebra, and possibly plain old dual processor nodes if you
have jobs that are REALLY memory bound to where even dual cores start to
collide (YMMV very much here, be warned). dual-quads will get you
optimum raw compute capacity per dollar, though, I think, and sound
ideal for your expected initial task mix. Outfit the nodes with at
LEAST 1 GB per core, 2 GB is better. Any nodes you buy in this way will
have 2 gigE interfaces integrated on the motherboard, which is fine.
Try to get 3 to 4 year onsite service contracts on all "critical"
electronic hardware you buy, from the switch on down. As noted above 3
years = "infinity", at the end of this 3 year warranty you'll need to be
looking for replacement hardware in any event, as the cost of powering
any 8 nodes for a year will get really close to the cost of BUYING a
single node that will do the work of the 8 with the power cost of only
Node prices, including warranty, will then range from as low as a bit
under $2000 to $4000 depending on memory, number of cores and so on.
Avoid bleeding edge processor clocks for YOUR starter cluster -- look
for the sweet spot in CPU clock (aggregate cycles) per dollar spent,
usually the second or third cheapest available CPU in any given
configuration (bearing in mind the TOTAL SYSTEM price, not just raw CPU
price in your cost-benefit estimates).
Going this route, $1000 for rack plus accessories, $2500 for a head
node, $2000 for a single worker node, $500 for error in my seat of the
pants estimates and miscellaneous stuff -- you can "get started" with at
least 4, maybe 8 >>modern<< (64 bit, uberfast) CPU cores for around
$5000, get started with backup for around $7000, get started nicely with
as many as 24 CPU cores for maybe $12,000. Which is still, believe it
or not, chickenfeed in the research business.
This design scales beautifully. Go to your nuclear groups, pass the
hat. Offer them free room in the rack, access to server and switch and
backup (all paid for by the department, the university, a startup grant,
whatever) if they pony up $2000-4000 for N-core nodes that are selected
from the following list, with mandatory onsite service contracts.
They'll jump at the chance -- they'd have to spend twice as much to get
the same capacity as THEY'D have to provide access to a server, AC,
power, infrastructure, management. Point out that with lots of
participants, they can share resources -- everybody individually will
have down time when they're writing papers, are out of town, on vacation
-- and they can trade access to their nodes when they're not using them
to others in return for the same favor the other day. So if they buy a
node with 8 cores in the rack and so do three other groups, there might
come a day when they can use all 32 processors in a pinch to finish off
a paper before a deadline.
It is also easy to write proposals for. Any of your groups can write or
add to a propoposal a budget for N nodes that fit in the existing rack.
University cost-sharing is manifest, resources are well-leveraged,
funding is likely. With a full-height rack, you can add as many as 40
1U nodes to 3-5U of switches and servers, on a floor that can hold 1 ton
per square meter, in a space that can provide 8-10 KW and 4 tons of AC
(per filled rack).
THIS sort of design can scale right on out of your department.
Chemistry may want to play. So might engineering. Even economics does
large scale computations nowadays. You might find yourself setting up
and filling a cluster room with multiple racks, wall-sized Liebert ACs,
and so on.
Or anyway, you can at least dream...;-)
Obviously I favor this approach if you can finagle the minimum $5K
buy-in, STRONGLY favor it if you can scare up $7K or more. I also tend
to recommend that you look at e.g. www.penguincomputing.com for
possible nodes, because they are linux-passionate and their AMD opteron
nodes are excellent performers and simply (to my own experience) do not
break. They'll likely cut you a break of a few percent on a collective
"getting-started" price as well.
When trying to "sell" this approach, point out to the powers that $5K,
$10K, $15K is not the real cost of the cluster. The $1 per watt per
year for power and AC (estimated) is not the real cost either. The real
cost is the human time required to design it, set it up, and manage it.
That cost is $50K and up per year! If you're doing this as a project
"for free", they are already getting tens of thousands of dollars of
free resource, which should certainly factor into the leverage required
to pry the money loose to take proper advantage of you!
> I want to be able to make the cluster easily expandable, in that I will be
> starting with only a few machines (about 2-8), but will be acquiring more as
> time goes on. The university that I am attending surpluses out "old"
> machines every 4 years, and we have set up a program where we can get a
> percentage of the surplus machines for out cluster.
> So, as for size. Initially it will be a smaller cluster, but will grow as
> time goes on.
> Being new to the Beowulf world, I am just mainly looking for some advice as
> to what distro to use (I would never dream of setting up a cluster on
> windows) and if there were any little tricks that weren't mentioned in the
> setup how to guides.
> Oh, and I would also like to know if there was a way to set up a task
> priority where if I had only only application running it would use all the
> processors on the cluster, but if I had two tasks sent to the cluster then
> it would split the load between them and run both simultaneously, but still
> using a maximum for the needed processors.
> Thanks again so much,
> On Feb 8, 2008 9:11 AM, Robert G. Brown <rgb at phy.duke.edu> wrote:
>> On Thu, 7 Feb 2008, Berkley Starks wrote:
>>> Hello all,
>>> I've been a computer user for the past several years working in
>>> areas of the IT world. I've recently been commissioned by my university
>>> set up the first operating Beowulf Cluster.
>>> I'm am moderately familiar with the Linux OS, having ran it for the past
>>> several years using the distro's of Debian, Ubuntu, Fedora Core, and
>>> With setting up this new cluster I would like any advice possible on
>> what OS
>>> to use, how to set it up, and any other pertinent information that I
>> This question has been answered on-list in detail a few zillion times.
>> I'd suggest consulting (in rough order):
>> a) The list archives (now that you're a member you can get to them,
>> although they are digested and googleable for the most part anyway).
>> b) Google. For example, there is a lovely howto here:
>> that is remarkably current and a good quick place to start.
>> c) Feel free to browse my free online book here:
>> I'm working on making it paper-printable via lulu, but I need time I
>> don't have and so that project languishes a bit. You "can" get a paper
>> copy there if you want, but it is pretty much what is on the free
>> website including the holes.
>>> Oh, and the cluster will be used for computational physics. I am a
>>> major making it for the physics department here. It will need to be
>> able to
>>> use C++ and Fortran at a bare minimum.
>> C, C++ and Fortran are all no problem. The more important questions
>> a) How coupled are the parallel tasks? That is, do you want a cluster
>> that can run N independent jobs on N independent nodes (where the jobs
>> don't communicate with each other at all), or do you want a cluster
>> where the N nodes all do work on a common task as part of one massive
>> parallel program? If the former, you're in luck and cluster design is
>> easy and the cluster purchase will be cheap.
>> b) If they are coupled, are the tasks "tightly coupled" so each
>> subtask can only advance a little bit before communications are required
>> in order to take the next step? "Synchronous" so all steps have to be
>> completed on all nodes before any can advance? Are the messages really
>> big (bandwidth limited) or tiny and frequent (latency limited)?
>> If any of these latter answers are "yes", post a detailed description of
>> the tasks (as best you can) to get some advice on choosing a network, as
>> that's the design parameter that is largely controlled by the answers.
>>> Thanks again
>> Robert G. Brown Phone(cell): 1-919-280-8443
>> Duke University Physics Dept, Box 90305
>> Durham, N.C. 27708-0305
>> Web: http://www.phy.duke.edu/~rgb <http://www.phy.duke.edu/%7Ergb>
>> Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php<http://www.phy.duke.edu/%7Ergb/Lilith/Lilith.php>
>> Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977
Robert G. Brown Phone(cell): 1-919-280-8443
Duke University Physics Dept, Box 90305
Durham, N.C. 27708-0305
Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977
More information about the Beowulf