[Beowulf] The True Cost of HPC Cluster Ownership

Joe Landman landman at scalableinformatics.com
Tue Aug 11 11:01:37 PDT 2009


Daniel Pfenniger wrote:

>>     There is a cost to *EVERYTHING*
> 
> Well, not really surprising.  The point is to be quantitative,
> not subjective (fear, etc.).  Each solution has a cost and alert
> people will choose the best one for them, not for the vendor.

Sadly, not always (choosing the best one for them).  *Many* times the 
solution is dictated to them via some group with an agreement with some 
vendor.  Decisions about which one are best are often seconded behind 
which brand to select.  I've had too many conversations that went "we 
agree your solution is better but we can't buy it because you aren't 
brand X".  Which is not a good reason for selection or omission of a vendor.

> If many people choose IKEA furniture over traditional vendors
> it is because the cost differential is favourable for them,
> even taking all the overheads into account.

Agreed.  But furniture is not a computer (though I guess it could be ...)


> When commodity clusters came in the 90's the gain was easily a
> factor 10 at purchase.  In my case the maintenance and licenses
> costs of turn-key locked-in hardware added 20-25% of purchase
> cost every year.  With such a high cost we could have hired an
> engineer full-time instead, but it was not possible because of
> the locked-in nature of such machines.  The self-made solution
> was clearly the best.

For some users, this is the right route.  For Guy Coates and his team, 
for you, and a number of others.  I agree it can be good.

But there are far too many people that think a cluster is a pile-o-PCs + 
a cheap switch + a cluster distro.  Its the "how do I make it work when 
it fails" aspect we tend to see people worrying online about.

I am arguing for commodity systems.  But some gear is just plain junk. 
Not all switches are created equal.  Some inexpensive switches do a far 
better job than some of the expensive ones.  Some brand name machines 
are wholly inappropriate as compute nodes, yet they are used.

A big part of this process is making reasonable selections. 
Understanding the issues with all of these, understanding the interplay.

I am not arguing for vendor locking (believe it or not).  I simply argue 
for sane choices.

> Today one finds intermediate solutions where the hardware is
> composed of compatible elements, and the software is open source.
> Some vendors offer almost ready to run and tested hardware for
> a reasonable margin, adding less than a factor 2 to the original
> hardware cost, without horrendous maintenance fee and restrictive
> license.  The locked-in effect is low, yet not completely zero.
> This is probably the best solution for many budget-conscious
> users.

Yes.  This is what we stress.  We unfortunately have run into purchasing 
groups that like to try to save a buck, and will buy 
almost-but-not-quite-the-same-thing for the clusters we have put 
together, which makes it very hard to pre-build, and pre-test.  Worse, 
when we see what they have purchased, and see that it really didn't come 
close to the spec we used, well .... I fail to see how being required to 
purchase the right thing after purchasing the wrong thing that you can't 
return, saves you money.

We have had this happen too many times.

>> Heinlein called it TANSTAAFL.  Every single decision you make carries 
>> with it a set of costs.
>>
>> What purchasing agents, looking at the absolute rock bottom prices do 
>> not seem to grasp, is that those costs can *easily* swamp any 
>> purported gains from a lower price, and raise the actual landed price, 
>> due to expending valuable resource time (Gerry et al) for months on 
>> end working to solve problems that *should* have been solved previously.
>>
>> There is a cost to going cheap.  This cost is time, and loss of 
>> productivity.  If your time (your students time) is free, and you 
>> don't need to pay for consequences (loss of grants, loss of revenue, 
>> loss of productivity, ...) in delayed delivery of results from 
>> computing or storage systems, then, by all means, roll these things 
>> yourself, and deal with the myriad of debugging issues in making the 
>> complex beasts actually work.  You have hardware stack issues, 
>> software stack issues, interaction issues, ...
> 
> You forget to mention that turn-key locked-in systems in my experience 
> entail
> inefficiency costs because the user cannot decide what to do when

I can't mention your experience as I don't have a clue as to what you 
have experienced.

Vendor lock in is IMO not a great thing.  It increases costs, makes 
systems more expensive to support, reduces choices later on.  Yet, we 
run head first into vendor lock-in in many purchasing departments.  They 
prefer buying from one vendor with whom they have struck agreements. 
Which don't work to their benefit, but do for the vendors.

> completely ignoring what is going on.  Many problems may be solved in
> minutes when the user controls the cluster, but may need days
> or weeks for fixes from the vendor.  A balanced presentation should
> weight all the aspects of running a cluster.

Yes.  Doug's presentation did show you one aspect, and if you want more 
to "balance" the joy of clustered systems, certainly, his work can be 
expanded and amplified upon.

> 
>>
>> What I am saying is that Doug is onto something here.  It ain't easy. 
>> Doug simply expressed that it isn't.
> 
>> As for the article being self serving?  I dunno, I don't think so.  
>> Doug runs a consultancy called Basement Supercomputing that provides 
>> services for such folks.  I didn't see overt advertisements, or even, 
>> really, covert "hire us" messages.  I think this was fine as a white 
>> paper, and Doug did note that it started life as one.
> 
> You may have noticed that this article was originally written on demand
> of SiCortex...

That wasn't lost on me :(.  Actually one of the things we are actively 
talking about relative to our high performance storage is "Freedom from 
bricking".  If a theoretical bus hits the company a day after you get 
your boxes from us, our units are still supportable, and you can pay 
another organization to support them.  We aren't aware of other vendors 
doing what we are doing that could (honestly) make such a claim.  Even 
the ones that use the (marketing) label of "Open storage solutions". 
Yeah.  Open.



-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615



More information about the Beowulf mailing list