[Beowulf] newbie's dilemma
Robert G. Brown
rgb at phy.duke.edu
Thu Mar 2 07:51:46 PST 2006
On Thu, 2 Mar 2006, Don R. Baker wrote:
> Hello again,
> O.K. So Option 3 -- 32 desktops from HP or Dell-- is eliminated because
> I cannot afford to upgrade the air conditioning unit in the room
> available and I cannot afford an onsite service contract to cover repair
Sure you can -- you just spend nodes to do so. If you get 10% fewer
nodes (approximately) you can get service contracts on them AND reduce
your AC load.
As I said, to get the most help from the list we have to know the
problem constraints as you lay them out below. I suspect that the
"problem" with your designs is primarily that you haven't properly
balanced your resources between the hardware up front and the long term
risks and costs of maintenance. Having done at least two clusters that
turned out to be disasters beyond all imagining from the
turned-out-to-be-a-piece-of-crap motherboard or the
blown-taiwanese-capacitor point of view, I will NOT buy any sort of
production cluster without a 3-4 year obligation on the part of the
vendor to fix it if it breaks, even if it ALL breaks, until I've gotten
my full expected money's worth out of it.
That involves simply adding the cost of the service contract into the
cost of the nodes, no matter who you get them from. Some companies
quote nodes to you this way up front -- they won't SELL them without
such a contract on them as their reputation goes to hell if you buy them
and they all break anyway, so the HAVE to fix them ultimately. The
taiwanese capacitor problem affected some 2/3 of the motherboards in
production at the time -- not even the best of vendors can always avoid
this. If you choose not to do it, well, then you gamble. Maybe your
systems run flawlessly for four years, only blowing a few power supplies
and a disk or two, and you stay will within your $500/year repair budget
and don't lose your mind fixing machines all the time by hand. Maybe
you run them for six months, get them past their 90 day warranty period,
and your AC fails for just long enough to toast every single
motherboard, or not toast them on the spot but heat them up enough that
nodes start failing every month instead of every year. Most of the
onsite service plans don't have exclusions except for obvious and
semideliberate damage on the part of the owner, although obviously you
want to look at this in detail, per contract.
> In response to RGB's request for more information:
> "It might be more helpful if you gave us your
> budget and your software constraints (e.g. how much memory per CPU or
> core do you need). I'm assuming embarrassingly parallel MC (which is
> what I do) so the network is basically irrelevant."
> Here are my budgetary constraints and my needs
> Budget ~US $ 25 000, with the possibility of "liberating" another $ 5
> 000 out of another grant or my university. My Monte Carlo simulations
> deal with percolation problems, Potts models, and fiber bundle models,
> some of which require in excess of 512 MB of memory; I am trying to buy
> machines with at least that amount of memory per core and ideally twice
> that amount. The network is irrelevant for these simulations, but based
> upon my reading I think I should go for gigabit ethernet.
OK. What I'd recommend is soliciting quotes for this. Doing so on this
list is OK -- people do it from time to time and a number of vendors
(well-behaved vendors!) are on the list. Some may respond politely to
you directly just from reading this, not with a quote but with an offer
to generate one. You'll probably/maybe get offers of a quote from
penguin (Michael Will), from ASL (Jeff Nugyen), from Scalable
Informatics (Joe Landmann) that I can think of, maybe a few others. See
what they can do. Some of them will be high in the sense that you are
willing to do more of the work to save money where they want to do more
of a turnkey cluster. Others will be happy to sell you the nodes and a
rack and leave the rest to you.
Using penguin's online node configurator you can get dual, dual core
Opteron 265 (Altus 1400) 1U nodes with 2 GB of memory and a small hard
disk (no CD drive), with 3 year standard warranty, for $2227. Drop the
hard disk and use warewulf diskless and you can save the $227, about.
Allowing for a rack and mounting hardware, for shipping, and for your
networking and miscellaneous costs (some of which might well be
compensated for in an actual quote, since buying multiple systems for a
university you'll likely get a small discount on the web price) you can
get somewhere between 8 (for sure) and 10 nodes. That is 512 MB per
core, the ABILITY to run jobs up to 2 GB in size using fewer than the
four cores available per box, the ability to trade off now and drop the
disk and get 4 GB/node (1 GB/core) for about the same price.
Note also that I'm guessing that these boxes draw somewhere between 200
and 300 watts fully loaded with tasks (anybody out there measure?
Michael? Y'all have a kill-a-watt back at your plant?). Just 8 of them
leaves you nicely set as far as your room's power and cooling resources
are concerned, where the extra heat generated by 16 or 32 chassis, their
individual power supplies, their individual (unused but powered up)
peripherals would become a problem.
That's a minimum of 32 to a maximum of 40 Opteron processor cores with
anywhere from 512MB to 1 GB of memory per core, well within your budget,
with full protection from MAJOR maintenance headaches for the life of
your project (yes, you still have to call them if things break but
honestly, almost none of the penguin nodes we've bought so far (opteron
242 Altus 1300's) have broken over the 1.5 years we've had them). We
have had a LOT more service issues (per unit time) with Dell hardware,
although Dell's onsite service is excellent after the fact.
I'm not suggesting you CHOOSE penguin, though -- get quotes from a lot
of these vendors, then ask opinions about specific quotes or
motherboards on the list if you have any doubts or questions. Penguin
has to earn your business by giving you a good price AND the warm
fuzzies reliability wise. That's the beauty of the free (COTS) market,
right? Go ahead and look into minitowers and steel shelving clusters as
well -- but DO include hot and cold running maintenance on them, even if
you price cheap old e-machines AMD-64's for $350 a box from Best Buy.
Last, consider a split strategy. Say you get the price per node down to
$2000, planning to run diskless and 2G, but have to configure ONE node
with disk etc and 4 GB as a server and place to run occassional "big"
jobs, get your rack, get a simple/cheap GigE switch and some wiring --
let's say that your cost profile is $2K per node, $4K for ONE node (the
"head node) and the rack combined, $1K for networking etc. You can do
networking etc $1K
head node $4K
seven nodes $14K
Get it. Install it. Live with it. Use it. For six months, for a
year. You've still got $11K leftover, right? At the end of this time,
you can decide to:
add five more nodes (if your power/cooling will stand it) $10K.
(Probably five more with 1 GB/core at that time).
add 2G per node to the ones you have, get maybe four more nodes with 4
add a high end network (oops, your problem turned out to need one)
add 3 nodes, lots of memory, and a serious visualization workstation
add 3-4 FASTER nodes -- CPU prices have dropped, memory prices have
For MC simulation, you can actually show that over a 3 year period it
makes as much sense to spend your money in 1/3's and ride Moore's Law
instead of spending it all in year one anyway, depending on just where
you are relative to the discrete jumps.
3 years x 1/3 x 1 (year 0 performance multiplier) = 1 cluster-year of
work from the first 1/3
2 years x 1/3 x 1.73 (year 1 performance multiplier) = 1 cluster-year of
work from the second 1/3
1 year x 1/3 x 3 (year 2 performance multiplier) = 1 cluster year of
work from the third 1/3
As you can see, you pretty much break even, on average, depending a bit
on just where the discrete jumps on Moore's Law performance land. I
think that the ability to redirect your money intelligently as your
actual production dictates dynamically more than justifies reserving at
least 1/3 or so for contingencies and design refinements.
I'm assuming, BTW, that you don't pay the power bills out of this
budget. If this assumption is NOT right, of course you have to buy a
LOT fewer nodes (about 1/3 fewer) and save money to pay for power...
> Thank you all for your thoughtful responses. I am finding them very
> Wishing you the best,
> On Wed, 2006-03-01 at 18:09, Robert G. Brown wrote:
>> On Tue, 28 Feb 2006, Don R. Baker wrote:
>>> for 8 years, but consider myself to still be a beginner. I have a room
>>> with 4, 15 amp circuits and a 20 000 btu air conditioning unit installed
>>> that I can use for the next 2 years, but after that I may need to find
>>> another home for the system.
>> Let's see. 20KBTU is a bit more than 1.5 tons of AC, call it the
>> ability to remove 5800 Watts total. 4 x 15 x 120 is is 7200 Watts peak,
>> or about 5000 Watts RMS. In my opinion this is going to leave you a bit
>> light on AC if you run the circuits fully loaded, and don't forget warm
>> bodies (60 W) and built in light bulbs etc. on other circuits (maybe
>> several hundred W more). You have to not only remove the heat as fast
>> as it comes in but get ahead some, correct for heat that infiltrates
>> through the walls, and get the room temperature down below 20C (68 F) if
>> at all possible. 15-16C is more like it -- cold enough to just be
>> If you limit what you run per circuit to roughly 1000 Watts, that is
>> 4000 watts and gives you a bit of margin. Or get a bigger AC -- a 2 ton
>> AC is still pretty cheap and would probably manage fully loaded
>> circuits. Just a thought.
>>> My dilemma is that for my budget I can buy one of the following
>>> Solution #1
>>> A custom built "personal cluster" with 8 dual core processors either
>>> Xeons or Opterons (16 cores and 16 GB of memory) with all the software
>>> installed, read to go.
>>> Solution #2
>>> I can buy 16 workstations, each with Dual Core Athlon X64 4400+
>>> processors (32 cores and 32 GB of memory) upon which I will probably
>>> install either Warewulf or Oscar.
>>> Solution #3
>>> I can buy 32 HP or Dell "mass market" desktops running dual core chips
>>> (64 cores and 64 GB memory) upon which I will probably install either
>>> Warewulf or Oscar. (Note that I read the discussion this past November
>>> on "cheap PCs this christmas")
>>> Obviously, I get more computing power in the last two solutions, but at
>>> what cost in terms of time and upkeep? Once the system is up and
>>> running I can dedicate about 5 hours per week, and probably no more, and
>>> CAD$ ~500 per year for maintenance.
>> I personally would reject #3 out of hand, unless you buy three year
>> onsite service contracts on the Dells (spending nodes as required).
>> Dell doesn't do Opterons, I don't think, as well. HPs ditto.
>> Solutions #1 or #2 are both reasonable, although I'm not sure where your
>> numbers are coming from. It might be more helpful if you gave us your
>> budget and your software constraints (e.g. how much memory per CPU or
>> core do you need). I'm assuming embarrassingly parallel MC (which is
>> what I do) so the network is basically irrelevant.
>>> Do any of you have some sage advice? Have any of you used a "personal
>>> cluster"? Any thoughts you may have will be very much appreciated.
>>> Thank you all for your time.
>> Sure, a bunch of us (myself included) have personal clusters, although
>> yours is going to be mine -- I never have more than about 10 nodes
>> because at that point my house starts to melt in the summertime (and the
>> nodes start to cost roughly $1000/year just to run). Remember, power
>> costs ballpark of $1/watt/year to heat AND remove the heat (within a
>> factor of two) so if you DO fill your room to capacity with 4000 watts
>> running 24x7, plan to spend around $4000/year just to run it and keep it
>>> Wishing you the best from a cool Montreal,
>> Although there is that -- I suppose in the wintertime you could just
>> open a window and snow-cool it... but that at most knocks it down to
>> $3000, because most of the money is for the power, not the cooling:-).
>>> From this point of view getting fewer, faster nodes (e.g. 8 dual-dual
>> core processor from e.g. Penguin or ASL (32 processor cores) is likely
>> to be a net savings in power, in money PAYING for power, high quality
>> nodes are less likely to break, and less of your time doing both soft
>> and hard maintenance. I'd really try to keep your system count down for
>> home clusters as they can eat enough time and money to destroy personal
>> relationships with loved ones...
>> They don't have to be preinstalled with linux, though. Oh, they may BE
>> preinstalled (often with SuSE) but I'd advise reinstalling Centos or FC
>> (see archives for pros and cons of choice). That way you get an
>> indefinite free update stream and full yum-ability. SuSE does yum
>> (thanks to Joe Landman of this list, who might ALSO sell you prebuilt
>> nodes) but it ain't necessarily pretty...
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
More information about the Beowulf