[Beowulf] Re: vectors vs. loops

Tue May 3 12:11:18 PDT 2005

On Tue, 3 May 2005, Philippe Blaise wrote:

> Robert G. Brown wrote:
> 
> >....
> >
> >Still, the marketplace speaks for itself.  It doesn't argue, and isn't
> >legendary, it just is.
>
> But, does the hpc marketplace have a direction ?

Of course it does.  At the moment I would say that the direction isn't
obviously apparent only because the last decade plus has been so
overwhelmingly directed towards COTS clustering and away from big iron
supercomputers.  Clustering has been so fantastically successful that
clusters have nearly saturated the available HPC marketplace, enabled
new growth in comletely new directions (such as bioinformatics and
rendering), made inroads into completely new non-scientific disciplines
(e.g. economics), started to appear in medical applications (not just
research, applications), created whole new genres of games, been the
basis for the most persistent and successful part of the internet
revolution, and STILL continue to evolve and grow and create new
markets and applications.

In the HPC market, any apparent lack of direction is caused by the
overwhelming satisfaction and degree of cost-benefit optimality in place
in all of the clusters in operation in scientific labs and science
departments all over the planet, but especially in places with modest
resources or a lack of easy access to expensive centralized
"supercomputing" centers.  There are clusters in operation at small
technical colleges in India where buying ANYTHING to support science and
the technical development of students is an agonizing process due to
pure and simple lack of resources.  There are clusters in operation at
huge computer centers in the United States where they buy hundreds to
thousands of nodes a year.  The war is over.  Clusters won, and will
continue to win until a better (more cost-beneficial) paradigm comes
along, and are a "super paradigm" that will likely take systems based on
a better paradigm and build a cluster out of them.

During this entire period vector systems have never gone completely out
of favor, simply because they ARE suitable for a certain class of
problems.  Some problems in that class are considered "valuable" to both
businesses and the scientific community -- valuable enough to justify
spending a relatively large amount of money on vector computers in order
to accomplish the work faster.  Even here, though, the obvious and
frequently readily accessible power of parallelism has really pushed
vector systems from being "the" supercomputing model in the form of
standalone units with relatively few processors towards being CLUSTERS
of processors (with whatever network/memory model/interconnect).

Nearly anybody doing useful work at all can do more useful work with
several systems working in parallel due to scaling and physical limits
that will ALWAYS limit how much work one can do with a single
"computer".  And I'm not talking about the problem of "parallelizing a
task" with IPCs and everything -- I'm just talking using a cluster to
run tasks in embarrassingly parallel mode, the way nearly anybody can
get approximately linear speedup relative to running on a single system.

> Few years ago, some people had a "fantastic vision" to replace the 
> vector machines market :
> use big clusters of SMPs with the help of the new paradigm of hybrid 
> mpi/openmp programming.
> Then the main vendors (usa), except Cray, were very happy to sell giant 
> clusters of smp machines.
> 
> Nevertheless, the japanese guys built the "earth simulator" ; which is 
> still the most powerful machine in the world
> (don't trust this stupid top500 list).
> 
> Then Cray came back ... with vector machines...
> 
> Don't underestimate the power of vector machines.
> Yes Fujitsu or NEC vector machines are still very efficient, even with 
> non contiguous memory access (!!).
> 
> One year ago, the only cpus that sometimes were able to equal vectorial 
> cpus were alpha (ev7) and itanum2 with
> big caches and / or fast memory access. Remember that alpha is dead. 
> Have a look to the itanium2 market shares.
> 
> The marketplace is not a good argument at all.

The "marketplace" is the ONLY argument that matters.  The economics of
giant "big iron" vector machines is THE dominant force that underlies
all current efforts associated with them.  Fujitsu, NEC, Cray, SGI, IBM,
maybe even Sun -- the surviving big iron companies don't survive on
low-margin sales, and never make a massive investment without expecting
a profit (high margin profit) at the end of it. They believe that they
can spend a fortune designing a system that will never sell to more than
a few dozen companies or operations (at millions of dollars apiece) and
still make money.  Lots of money.  As long as they are correct, this
sort of design will persist.  How correct they are depends in part on
how well they market BOTH the machines AND the "importance" to society
of the relatively few problems that they are supposed to solve
better/cheaper/faster than the alternatives permit.  I suspect that in
several cases the companies are de facto subsidized by various
governments or parts of governments interested in supporting the
continuing development of powerful vector systems for reasons of their
own (military, industrial, economic) where the application space on the
face of things might not really justify it.

Remember a point that has been repeatedly made on this list -- the
further you are from "commodity" space (measured in mass market units
sold and the number of vendors that support it) the more costly things
get on the consumer side of things.  The economics of system design and
marketing is highly nonlinear, highly competitive, and in places
survives on razor-thin margins.  Note that the alpha did NOT save DEC or
provide much benefit to DEC's successive merger-inhalers.  Itanium has
at best been a break even proposition for Intel -- when I've talked to
Intel people directly they tend to hem and haw a bit about its future
(corporate party line aside) and I suspect that they actually LOST money
on Itanium, which isn't really all that surprising considering how
expensive it was to build and how little excitement it generated in at
least HPC.

The point being that the market has had opportunities time and again to
reward DEC or Intel for building fast memory, big cache processors.
Instead the market (and I'm still talking HPC market, not even the
general market) chose overwhelmingly to purchase really cheap but "fast
enough" Durons, Celerons, PIII', P4s, PPros.  Even big cache mainstream
processors like the Xeons have suffered and lost market share when any
significant degree of price premium has been associated with their
implementation.  The HPC market has >>punished<< high performance
departures from the COTS mainstream more often than not for close to a
decade now.

So if really expensive high performers are failing, who is succeeding?
AMD, with the Opteron which combines 64 bits, superior floating point
performance, and a LOW PRICE (relatively speaking) -- one where the
cost-benefit advantage relative to anything else available is
slap-in-the-face obvious for most HPC-like tasks I've tested myself or
heard of being tested.

The processor and systems design wars that matter (to "most" -- that
funny word again -- HPC cycle consumers) aren't being fought in NECs
design chambers; they are being fought between Intel and AMD and involve
multicore processors, memory crossbars, heat dissipation, and just how
much valuable chip real estate to devote to floating point vs integer vs
interrupt handling vs cache vs memory integration to make the biggest
market segment as happy as possible and provide the best possible
BALANCE of performance.  To Intel, HPC is "important" but far from
critical.  To AMD, I think that HPC is maybe critical -- they've
successfully defined themselves as a premier HPC platform and clearly
invest more resources in floating point.  Intel still rules the desktop,
though, although AMD continues to fight gamely there, where floating
point performance isn't terribly important.

The thing that continues to speak against vector processors is the fact
that aside from task-specific ASICs (DSP and GPU) the "mass market"
(including HPC) has yet to support an OTC vector unit that integrates in
any OTS system design.  Why this is so I'm not certain -- very likely
the memory speed requirements are so much beyond what inexpensive OTS
DRAM can support that including them is pointless for most applications
-- but whatever the nominal reason, since it is clearly POSSIBLE to do
so the REAL reason it doesn't happen is because nobody (thinks they)
would make any money from it if they did it.  It's all about economics
and the marketplace, nothing else.  It would cost more than the mass
market is willing to pay, and probably more than the more cost-tolerant
HPC market is willing to pay in order for the implementer to make enough
money to justify the effort and risk.

> Vectorization and parallelization are compatible

Absolutely.

> Hybrid mpi/openmp programming  is a harder task than mpi/vector programming.
> If you have enough money and if your program is vectorizable, buy a 
> vector machine of course.

I'm not sure what the former means.  I honestly think that MOST clusters
on the planet don't run "real" parallel code, but rather run N instances
of a single threaded application in an embarrassingly parallel way,
probably at a ratio of 60:40 or even more.  Writing real parallel code
with nontrivial message passing and barriers IS a fairly difficult task,
although MPI enables it to be done reasonably portably (where the
algorithmic and parametric tuning per cluster still remains as a serious
portability issue for many applications).  Vector programming I view as
a primarily single-thread issue independent of parallelization.  From
the parallel programmer's point of view I would think that it only
shifts the balance of compute to communicate around, usually in a way
that favors less parallelism in communications bounded code (by
decreasing the time spent per work chunk per IPC in a given task
partitioning).  For EP tasks of course it doesn't matter and the faster
the better.

> Cluster of SMPs ? they will remain an efficient and low cost solution, 
> (and quite easy to be sold
> by a mass vendor).
> And thanks to cluster of SMPs with the help of linux, the HPC market is 
> now "democratic".
> 
> Of course, it would be nice to have a true vector unit on a P4 or Opteron.
> But the problem will be the memory access again.

There we are in complete agreement, on both counts.  Still a
possibility, though, in future designs -- future memory designs in
particular look like they'll be much more "democratic" about providing
independent (non-CPU mediated) access to the system memory to things
like coprocessors.  This leaves open as possibilities intriguing models
of single system parallelism where sub-tasks are run on an attached
processor while the main CPU does its general purpose thing.

   rgb

> 
> Bye,
> 
>   Phil.
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu