[Beowulf] Roadrunner shutdown

Brian Dobbins brian.dobbins at yale.edu
Fri Apr 5 14:21:35 PDT 2013

On 4/5/2013 9:43 AM, Lux, Jim (337C) wrote:
> I would think that the problem is more that you can easily stamp out
> another 1000 processors than another 10 software developers.  HPC
> developers are the scarce commodity, and just throwing money at it doesn't
> solve the problem.

   I'm coming from an academic perspective, but to me it seems that HPC 
developers are the scarce commodity not because they're hard to find 
(though there's some truth in that, especially for ones with broad 
experience), but because they're hard to categorize and the benefits of 
a developer are hard to quantify.   A lot of people outside this 
relatively small Beowulf community have an overly simplistic 'model' of 
how computing works -- hardware works or it doesn't and if it does, a 
scientist writes a code and gets results. It's like pressing a button, 
basically.  Nobody thinks about pressing a button /well/, or pressing it 
/efficiently/, or even ensuring that whatever happens when it's pressed 
happens /fast/. It's just assumed that it happens as fast as it can, 
provided the button is 'working'.

   In this view, buying X more nodes makes a lot more sense than hiring 
a developer.  There are never enough cycles to begin with, so all the 
developer does is help make sure the system or code 'works'.  And if the 
system isn't working, well, that's what your systems people are for.  If 
the code isn't working?  Well, some lucky grad student will be up all 
night and day trying to fix it. Sometimes for weeks.  Months.  Even 
years.  And, all this time, the 'scientific throughput' on the whole is 
probably up on the system, due to the additional nodes, even without 
those few applications that aren't working.

   However, once you look at the /details/, often times the scenario 
changes - why, yes, those extra nodes are being used all the time, 
running an atmospheric model 24/7, on a thousand processors.  Except, 
whoops, it's using serialized I/O to your underlying parallel file 
system - so while it's 'working', you could be running 4-5x faster with 
some changes to a library and a few changed settings.   Is this 
something the scientist should know?  We can debate that; I'm on the 
side of 'no, they should focus on their science', but there are points 
to be made on each side.  The fact is, though, that they often /don't/ 
notice it,// whereas someone whose focus is not on the scientific output 
but on the computational methods and techniques would do so quickly.   
And, like that, the equation changes - in the parallel universe where 
you /hired/ this person, you've now saved Y node-hours of computation, 
with a fairly small time-slice of your expert.  How does this savings 
compare to the new nodes you'd buy?  That depends on the scale of the 
operation, but we can run some numbers in a moment.

> That is why the real challenge for HPC is in developing smart compilers
> and tools that make it easier to do HPC, even at the cost of needing more
> computational resources.

   I'm absolutely in favor of smart(er) compilers and tools, but like 
any tool, I think an expert will wield it with much greater skill, 
experience and insight than a novice.  Give a chef a knife and some 
vegetables and watch as they're turned into perfectly cut pieces. Give 
me the same knife and set of vegetables and half would end up on the 
floor, I'd probably need a few band-aids, and we'd likely be ordering 
take-out.   Even 'simple' tools like compiler-enabled profiling 
typically presents a lot of information to a user that they're uncertain 
how to sift through at first, whereas an expert will know exactly what 
to look for, how to use it, and get results quickly.

   (It's occurred to me we might be using 'HPC developer' in very 
different ways -- as a dedicated person, /developing/ the majority of a 
model some of this might not apply.  As a 'specialist' who works with 
other scientists to allow them to focus on their science while they 
focus on the code, calculations and systems, I think it applies pretty 

> Just back of the enveloping here..
> Let's say a "node" costs about $3k plus about $1500 in operating costs
> over 3 years. Make it a round $5000 all told. A developer costs about
> $250-300k, fully burdened for a year. So I can make the choice.. Buy
> another 150 nodes or pay for the developer to make the processing more
> efficient.  Of course, if I buy the nodes, I get the faster computation
> today. If I buy the developer, I have to wait some time for my faster
> solution.

   I haven't looked at hardware prices in a while, but I'd argue 
slightly higher hardware costs -IB, more storage, etc.- not just on the 
node, but ports for the switches, extra disks or Lustre servers, let's 
say $6K all around, including operating costs.  I admittedly don't know 
much about employee costs, but let's imagine a mid-level HPC specialist 
at a university having a salary of $75K.   The fringe rate at a few 
places I just checked hovers around 33-36% or so for this type of 
employee, so ignoring things like office space costs, power, 
incidentals, etc., and just going with the cost to the university as 
'salary * (1.0 + fringe rate)', we come up with just over $100K.  Let's 
bump it up to $120K just because.

   For simplicity lets keep the salary static over four years (chosen as 
the life of the nodes), and now we're comparing $480K for an employee or 
80 nodes, and what delivers better usage.   The extreme case of having 
lots of nodes is, in my opinion, pretty easy.  If I have 2000 nodes now, 
adding 80 gives me a 4% improvement in my capabilities, and I'll quite 
happily challenge anyone who thinks I can't improve either their codes, 
workflow or system by more than 4%.  What if I have only 200 nodes, 
though?  Then the extra 80 gives me a 40% increase in job throughput.  
That /sounds/ pretty good, and in reality it might be if your codes are 
well-behaved, production-quality codes that are tuned, your scientists 
know how to modify them for new experiments, and there isn't much need 
of expertise, but 40% more cycles would help.  I've yet to meet a 
scientific department that meets that description, though.    From 
people running N^2 algorithms when N-log-N methods exist, to people 
using NFS as local scratch for large temporary files, to people managing 
literally thousands of files /by hand/ for a parameter-sweep MC code 
because scripting isn't something they're familiar with, a decent level 
of expertise can /often/ render 2-3x factor improvement in usage, and 
/sometimes/ much more, but the really high cases (1000x+) are pretty 
rare.  Still, we're talking a 2-3x factor being common, so even if the 
'average' gain when normalized across all your resources is a mere 1.5x, 
that still beats 1.4.  And we're just talking about savings in CPU time, 
not savings in scientist time.

   Of course, then you have the extreme case of very /few/ nodes - if 
you've only got twenty, and are looking at either 80 more or a person, 
well, it's going to depend heavily on what's being done, and I'd 
typically lean towards the nodes if I had only that binary choice.  If I 
could be creative, though, I'd aim to coordinate with multiple 
departments, buy maybe 40-60 nodes, and hire a person.  Besides, that 
expertise might help you take advantage of off-site resources like XSEDE 
even if you have zero nodes locally.

> If I buy the developer, I have to wait some time for my faster solution.

   Yes, if we're talking about developing entirely new methods.  But 
there's a /ton/ of low-hanging fruit that exists in the mix of systems 
tuning, compiler options, /basic/ code changes (not anything deep or 
time-consuming), etc., that takes hours or, at most, a few days, and can 
have massive impacts.  The serial I/O -> parallel I/O example way (way, 
way) above being one example, and as another, I can't tell you the 
number of times I've seen people running production runs with '-O0 -g' 
as their compilation flags.  Or, not using tuned BLAS / LAPACK 
libraries.  Or running an OpenMP-enabled code with 1 process per node 
but having OMP_NUM_THREADS set to 1 instead of 8.  Or countless other 

   So that's my lengthy two cents in defense of why it's /very often/ 
favorable to hire HPC specialists over more hardware - the gains are 
certainly much harder to quantify, and clearly more variable depending 
upon the various projects and applications in use, but in my experience 
-and in our environment- the gains in terms of making scientists' jobs 
easier and less time consuming, as well as the saving in CPU hours, more 
than makes it worth it. Plus, in those rare moments we're not working, 
the HPC developers can contribute to discussions on the Beowulf list.  
Surely that's something we can all agree we need more of!  :)

   - Brian

(PS.  One thing I cleverly avoided touching upon way back at the top is 
that while a skilled computational person can indeed ensure that 
calculations are more efficient, fast and working well, they can /also/ 
make sure they're working /correctly/.  That's a can of worms that's a 
different discussion!  Getting results /= getting /correct/ results. )

(PPS.  Sorry for the length!)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20130405/ac2650a5/attachment.html>

More information about the Beowulf mailing list