[Beowulf] Definition of HPC
Ellis H. Wilson III
ellis at cse.psu.edu
Mon Apr 15 20:42:32 PDT 2013
On 04/15/2013 02:21 PM, Prentice Bisbal wrote:
> "High performance computing (HPC) is a form of computer usage where
> utlilization one of the computer subsystems (processor, ram, disk,
> network, etc), is at or near 100% capacity for extended periods of time."
It would be helpful to me if we could clarify what the end goal here is
to defining "HPC." Scientists classify things when it is helpful to do
so, but since the mantra of our science is "it depends," and we need to
hear about all of your goals/details/code/etc anyhow, it seems like
doing so makes trying to then put you back in a general category moot.
It is not as if after hearing those goals we'd be like, "Oh here, you're
clearly trying to solve an HPC problem, so take this push-button HPC
solution!" Not that simple I fear, so I wonder about the utility of
drawing imaginary lines in the sand, unless this is the Beo-marketeering
list. In which case, please let me know so I can unsubscribe now :D.
Getting back to the article, I am particularly troubled a number of
seemingly obvious issues (at least to me, but I could be very wrong),
when comparing cloud costs to purchasing one's own machines:
"Over three years [to purchase and run your own servers], the total is
US$ 2,096,000. On the other hand, using cloud computing via Cycle
Computing...over the three years, the price is about US$ 974,025. Cloud
computing works out to half the cost of a dedicated system for these
workloads."
Issue #1:
This is my biggest issue. Where in the world is there just ONE,
isolated researcher with a budget for three years of a million dollars?
Find another researcher to split a cluster with and match costs. Or
four, and do it for half the price of Cycle Computing. Or just buy 5
times less compute, and wait for 600 seconds instead of 120, or for an
hour and 15 minutes instead of just 15 minutes, to use both of the
examples provided.
Along these lines, I am not buying the argument that some researcher out
there has a completely EP problem (basically a set of scripts) and is
blaming the scheduler for having to wait to concurrently run on 50k or
100k cores. That's his or her own fault. Just break your problem into
many jobs (50k or 100k separate jobs would be fine), and so long as the
machine isn't busy with somebody else's jobs, your scheduler isn't
broken, or you haven't burnt up your credits in your institutions
scheduling policies, you should proceed far quicker than having to wait
until all of the giant cluster is empty for your pointlessly huge job to
run a bunch of totally discrete tasks. Maybe someone can clarify
something I'm missing about WHY these tasks need to run at exactly the
same time?? What's wrong with your job running NOT concurrently over 2
or 3 hours? You're going to wait that long to get a set of instances
that large anyhow. And if a bunch of million-dollar-toting researchers
constantly find themselves maxing out the cluster and waiting too long,
just spend that cash on owned compute to expand it. That's a good
problem to have (having money and spending it on stuff you will own
after the day is over).
Issue #2:
I think this is a too "black and white" evaluation of owned compute
versus rented compute. Probably, what you really want, is to own some
static amount of compute that will be saturated a lot of the time, and
then rent out compute for big bursts (i.e. before conferences or some
other research push). Ideally, there should be some kind of scheduling
mechanism (maybe there is already, please share if you know it) that can
allow you to transparently expand your private cloud with rented public
cloud for those bursts, and research can go on with the exact same
commands and job scheduling expectations. Maybe a tad slower on the
public cloud machines, but nevertheless it will go on.
Best,
ellis
More information about the Beowulf
mailing list