[Beowulf] Definition of HPC

Mon Apr 15 20:42:32 PDT 2013

On 04/15/2013 02:21 PM, Prentice Bisbal wrote:
> "High performance computing (HPC) is a form of computer usage where
> utlilization one of the computer subsystems (processor, ram, disk,
> network, etc), is at or near 100% capacity for extended periods of time."

It would be helpful to me if we could clarify what the end goal here is 
to defining "HPC."  Scientists classify things when it is helpful to do 
so, but since the mantra of our science is "it depends," and we need to 
hear about all of your goals/details/code/etc anyhow, it seems like 
doing so makes trying to then put you back in a general category moot. 
It is not as if after hearing those goals we'd be like, "Oh here, you're 
clearly trying to solve an HPC problem, so take this push-button HPC 
solution!"  Not that simple I fear, so I wonder about the utility of 
drawing imaginary lines in the sand, unless this is the Beo-marketeering 
list.  In which case, please let me know so I can unsubscribe now :D.

Getting back to the article, I am particularly troubled a number of 
seemingly obvious issues (at least to me, but I could be very wrong), 
when comparing cloud costs to purchasing one's own machines:

"Over three years [to purchase and run your own servers], the total is 
US$ 2,096,000. On the other hand, using cloud computing via Cycle 
Computing...over the three years, the price is about US$ 974,025. Cloud 
computing works out to half the cost of a dedicated system for these 
workloads."

Issue #1:
This is my biggest issue.  Where in the world is there just ONE, 
isolated researcher with a budget for three years of a million dollars? 
  Find another researcher to split a cluster with and match costs.  Or 
four, and do it for half the price of Cycle Computing.  Or just buy 5 
times less compute, and wait for 600 seconds instead of 120, or for an 
hour and 15 minutes instead of just 15 minutes, to use both of the 
examples provided.

Along these lines, I am not buying the argument that some researcher out 
there has a completely EP problem (basically a set of scripts) and is 
blaming the scheduler for having to wait to concurrently run on 50k or 
100k cores.  That's his or her own fault.  Just break your problem into 
many jobs (50k or 100k separate jobs would be fine), and so long as the 
machine isn't busy with somebody else's jobs, your scheduler isn't 
broken, or you haven't burnt up your credits in your institutions 
scheduling policies, you should proceed far quicker than having to wait 
until all of the giant cluster is empty for your pointlessly huge job to 
run a bunch of totally discrete tasks.  Maybe someone can clarify 
something I'm missing about WHY these tasks need to run at exactly the 
same time??  What's wrong with your job running NOT concurrently over 2 
or 3 hours?  You're going to wait that long to get a set of instances 
that large anyhow.  And if a bunch of million-dollar-toting researchers 
constantly find themselves maxing out the cluster and waiting too long, 
just spend that cash on owned compute to expand it.  That's a good 
problem to have (having money and spending it on stuff you will own 
after the day is over).

Issue #2:
I think this is a too "black and white" evaluation of owned compute 
versus rented compute.  Probably, what you really want, is to own some 
static amount of compute that will be saturated a lot of the time, and 
then rent out compute for big bursts (i.e. before conferences or some 
other research push).  Ideally, there should be some kind of scheduling 
mechanism (maybe there is already, please share if you know it) that can 
allow you to transparently expand your private cloud with rented public 
cloud for those bursts, and research can go on with the exact same 
commands and job scheduling expectations.  Maybe a tad slower on the 
public cloud machines, but nevertheless it will go on.

Best,

ellis