[Beowulf] Utility Supercomputing...

Fri Mar 1 20:31:48 PST 2013

> http://www.hpcwire.com/hpcwire/2013-02-28/utility_supercomputing_heats_up.html

well, it's HPC wire - I always assume their name is acknowledgement that 
their content is much like "HPC PR wire", often or mostly vendor-sponsored.
call me ivory-tower, but this sort of thing:

 	Cycle has seen at least two examples of real-world MPI applications
 	that ran as much as 40 percent better on the Amazon EC2 cloud than
 	on an internal kit that used QDR InfiniBand.

really PISSES ME OFF.  it's insulting to the reader.  let's first assume it's 
not a lie - next we should ask "how can that be"?  EC2 has a smallish amount
of virt overhead and weak interconnect, so why would it be faster?  AFAIKT,
the only possible explanation is that the "internal kit" was just plain
botched.  or else they're comparing apples/oranges (say, different vintage 
cpu/ram, or the app was sensitive to the particular cache size,
associativity, SSE level, etc.)  in other words, these examples do not inform
the topic of the article, which is about the viability of cloud/utility HPC.

the article then concludes "well, you should try it (us) because it doesn't
cost much".  instead I say: yes, gather data and when it indicates your "kit"
is botched, you should fix your kit.

I have to add: I've almost never seen a non-fluff quote from IDC.  the ones
in this article are doozies.

> that are only great in straight lines ;-)  Another thing to think of is
> total cost per unit of science. Given we can now exploit much larger

people say a lot of weaselly things in the guise of TCO.  I do not really
understand why cloud/utility is not viewed with a lot more suspicion.
AFAIKT, people's thinking gets incredibly sloppy in this area, and they 
start accepting articles of faith like "Economies of Scale".  yes, there 
is no question that some things get cheaper at large scale.  even if we 
model that as a monotonic increase in efficiency, it's highly nonlinear.

1. capital cost of hardware.
2. operating costs: power, cooling, rent, connectivity, licenses.
3. staff operating costs.

big operations probably get some economy of large-scale HW purchases.  but
it's foolish to think this is giant: why would your HW vendor not want to 
maintain decent margins?

power/cooling/rent are mostly strictly linear once you get past trivial
clusters (say, few tens of racks).  certainly there is some economy possible,
but there's isn't much room to work with.  since power is about 10% of
purchase cost per year, mediocre PUE makes that 13%, and because we're
talking cloud, rent is off the table.  I know Google/FB/etc manage PUEs of
near 1.0 and site their facilities to get better power prices.  I suspect
they do not get half-priced power, though. and besides, that's still only
going to take the operating component of TCO down to 5%.  at the rate cpu
speed and power is improving, they probably care more about accelerated
amortization.

staff: box-monkeying is strictly linear with size of cluster, but can be 
extremely low.  (do you even bother to replace broken stuff?).  actual
sysadmin/system programming is *not* a function of the size of the facility
at all, or at least not directly.  diversity of nodes and/or environments
is what costs you system-person time.  you can certainly model this, but it's
not really part of the TCO examination, since you have to pay it either way.
in short, scale matters, but not much.

so in a very crude sense, cloud/utility computing is really just asking
another company to make a profit from you.  if you did it yourself, you'd 
simply not be making money for Amazon - everything else could be the same.
Amazon has no special sauce, just a fairly large amount of DIN-standard
ketchup.  unless outsource-mania is really a reflection of doubts about
competence: if we insource, we're vulnerable to having incompetent staff.

the one place where cloud/utility outsourcing makes the most sense is at 
small scale.  if you don't have enough work to keep tens of racks busy,
then there are some scaling and granularity effects.  you probably can't 
hire 3% of a sysadmin, and some of your nodes will be idle at times...

I'm a little surprised there aren't more cloud cooperatives, where smaller
companies pool their resources to form a non-profit entity to get past these 
dis-economies of very small scale.  fundamentally, I think it's just that 
almost anyone thrust into a management position is phobic about risk.
I certainly see that in the organization where I work (essentialy an academic
HPC coop.)

people really like EC2.  that's great.  but they shouldn't be deluded into 
thinking it's efficient: Amazon is making a KILLING on EC2.

> systems than some of us have internally, are we are starting to see
> overhead issues of vanish due to massive scale, certainly at cost?  I know

eh?  numbers please.  I see significant overheads only on quite small systems.

> for a fact that what we call "Pleasantly Parallel" workloads all of this

hah.  people forget that "embarassingly parallel" is a reference to 
the "embarassment of riches" idiom.  perhaps a truer name would be
"independent concurrency".

> I personally think the game is starting to change a little bit yet again
> here...

I don't.  HPC clusters have been "PaaS cloud providers" from the beginning.
outsourcing is really a statement about a organization's culture: an assertion
that if it insources, it will do so badly.  it's interesting that some very 
large organizations like GE are in the middle of a reaction from outsourcing...

regards, mark hahn.