[Beowulf] [OT] HPC and University IT - forum/mailing list?

Tue Aug 15 20:17:35 PDT 2006

On Tue, 15 Aug 2006, Mike Davis wrote:

> Mark Hahn wrote:
>> huh?  what value does big-A have to add here?  the correct queueing system 
>> is
>> the one that is cheap, low-maintenance, efficient, easy to use, etc. those 
>> are things that users and sysadmins know, not behind-desk-sitters...
>
>
> Difference of definition here. I believe that Big-A administration is how to 
> best manage the resources of Technology to meet everyones needs.

To hijack this theme (just a bit:-) it is important to note that I don't
agree with Mark that HPC "has" to be done centrally to be done well. If
this were true, cluster computing as we are discussing it would probably
not exist as it would never have had the opportunity to grow in
competition with the EXISTING central HPC paradigm of the time, e.g.
the Cray or other big iron solutions.

I actually think that we all benefit from having many kinds of clusters
and cluster paradigms, ranging from small to enormous, built by
everybody by kids through a team of professionals.  Out of this rich mix
comes innovation, invention, and new ideas, the fittest of which survive
(sometimes in niches, although many are broadly adopted).

This whole range of cluster computing structures can and often do
actually exist at a single university and few other places in the known
universe.  Which means that there are significant problems that face
big-A university Administrators when they try to sanely support cluster
computing.  Really any kind of computing, as there is always the same
conflict between "easy" and "controllable" centralized computing support
(which historically becomes institutionalized, stagnant, a haven for
empire-builders, and ultimately very inefficient) and decentralized
computing support, which is not easy at all, is intentionally only
partially "controllable" in the sense that Adminstrators like, but which
is dynamic, free, rapidly evolving, and which GENERALLY, if run by
people who think like surfers and are constantly looking to catch the
next good wave, remains just about as efficient economically as it is
possible to be.

Come to think of it, this is really an even broader polar conflict --
communism and centralized economy vs capitalism and free market economy,
feudal government vs democracy, Windows vs Linux.  To nearly any
significantly complex problem, there is a solution that is easy,
nominally controllable, expensive, and -- wrong.  Wrong in the specific
sense that it is like a gradient descent method in optimization theory
-- it rapidly moves along the easy uphill pathways to the top of a local
molehill and calls it the mountain, failing utterly to sample the
richness of the landscape all around and find not necessarily the
highest peak in all existence, but rather the best possible method for
following a path that leads down through valleys and up the
every-changing hills to find a pretty darn good peak.

Nature, naturally, uses rich methods to generate its steady succession
of ever higher evolutionary peaks including us.  Wise institutions
encourage a mix of rich, adaptive HPC and general computing environments
that permit a large measure of choice AND a large measure of
non-coercive guidance and support.

Such wisdom does not come easy, though, not even in the ivory tower.
Especially if you are a decision maker trying to intelligently spend
limited resources.  Large computer clusters require non-trivial
infrastructure support of all sorts -- physical space with specialized
engineering, large amounts of electrical power that meets certain
specifications, equally large amounts of highly reliable fail-safed
cooling, a certain amount of physical security and access control, and
competent engineering, systems management, and programming support.

None of these are free -- many of them aren't cheap.  And THEN there is
the cost of the cluster (nodes and network) itself, which really has to
be viewed as an amortized annual expense just to keep a cluster of
uniform relative (to Moore's Law progress) power going, the cost of
consumable resources -- human and electrical and tapes and wires and so
on.

Smaller clusters are in a sense much easier, especially if they are
built, administered, run, used by the same person -- take a bunch of
shelving, slap together some computers and a simple network, install
linux and utter a mystical incantation involving 1000 chickens vs an
ass, ending up with "poof, you're a cluster", then just USE the cluster,
paying the moderate electrical bill or getting it paid for by the
university.  You can't get much more cost efficient than that, which is
why and how clusters got their start.  It only gets inefficient when it
gets too large, or if a single user cannot maintain a ~100% duty cycle.

Then there is everything in between.  Big-A admins have no good way to
figure out how to allocate or partition resources required to support
this wide range of needs.  Centralized clusters are so expensive at the
infrastructure level (however cheap they are at the management level or
in terms of resource sharing to maintain a high duty cycle) that after
paying for them there is often little money left, especially since there
is ALWAYS an incentive to make the big bigger once you've made it at
all, just as there are ALWAYS people suggesting that their basket is
perfect for all the rest of the eggs and not just some of them.

So, is there a point to having some sort of forum for the care, feeding,
and education of Adminstrators who have to deal with University
Computing (in general) and HPC (specifically)?  Absolutely.
Communication and debate and information exchange are key components of
surfing and dynamical management -- with them it is actually possible to
semi-centralize while still remaining dynamic and free.  However, as
always there is the problem of having people around who have a number of
the largest possible views of the terrain, so that their views overlap
and complement one another and give participants a real perspective, an
eagle's eye view rather than an ant's-eye view, of the changing and
complex landscape of HPC.

That's what's so damn good about this list.  There are real eagle-eye
views available for the asking on nearly any aspect of cluster
computing.  I may not agree with Mark on everything, but his view and
description of running a large compute cluster as if it were a small
personal cluster (only bigger) resonates strongly with me.  So do the
many descriptions of smaller clusters, of big superclusters, of "grids"
or specialized clusters.  Listening carefully, it becomes very clear
that cluster computing is about diversity, not homogeneity, and that the
"standard recipe" cluster is, like so many recipes, just a starter
recipe to get you going where the real reward comes from the herbs and
spices and little bit of love that goes into that recipe as you change
it around and make it your own.  There is much wisdom, much experience.

However, it is NOT terribly accessible to big-A adminstrators, who may
view "solving the HPC resource allocation problem for the year" as ONE
thing on their calendar for THIS week and not want to have to completely
rethink it for a year or four thereafter.  It isn't that they don't want
to do a good job -- they just don't have the time or the knowledge to do
so on their own, and the easy, controllable, expensive and wrong
solution is always there, beckoning, often with at least one subordinate
who stands to benefit by being placed in charge of a new big centralized
resource cheering the whole thing on.

The question is, how does one fix this situation?  Most places that I
know of that "succeed" fix it the same way Mark does, the way Duke does
-- one or more good, knowledgeable people who have the eagle's-eye view
sit down in the most collegiate of traditions with Deans and other
decision makers and articulate, over and over again, how and why things
should be done thus and so.

Places that "fail" usually do so because they just plain don't have such
a champion and the adminstrators take the easy wrong road and "call
Microsoft".  At least metaphorically, although recently it sounds like a
real possibility.  Perhaps they refuse to support cluster computing at
all, perhaps they insist that cluster owners build, run, manage their
own clusters, perhaps they REQUIRE that all cluster computing WILL be in
the One True Cluster and I hope your code will run well on Intel Xeons
with GbE interconnects and nothing else running Microsoft's new cluster
product because that's what we've got and all you are permitted to buy
into, babeee...

So how to turn failures into successes, given that Hahns are a
relatively rare species?  How to ensure that more institutions succeed
first cluster around, get their entire cluster infrastructure projects
off of the ground effectively, avoid pitfalls and mistakes, even have
the maximal number of informed choices so that they can really
intelligently select infrastructure and support models that meet their
needs and their budget?

I don't know.  Really.  The beowulf list is a memetic breeding ground
where little Hahns all over the world CAN get their start from the memes
Mark so delicately scatters, so it has doubtless been instrumental in
most of the recorded successes in cluster support to date.  However, it
has not really closed the loop to eliminate the need for that critical
"cluster expert on the premises" that I think characterizes most real
success, it doesn't provide big-A-dministrators what they need to do the
right thing >>directly<<.  Perhaps a dedicated list or wiki or other
communications channel that reaches out to the right people WOULD do it
(or is doing it, as at least some doubtless exist already).  I'm just a
bit skeptical of this, though.

Big-A people are PEOPLE people, not list people or wiki people, by
virtue of a lifetime of career selection choices.  They are often
intelligent and very good at listening, learning rapidly, and making an
informed decision, but they usually need HUMANS to do the articulation
and don't (can't, really) use google, the web, textbooks, or lists to
figure it all out for themselves...

    rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu