[Beowulf] Cell

Robert G. Brown rgb at phy.duke.edu
Thu Apr 28 07:27:10 PDT 2005

On Thu, 28 Apr 2005, Vincent Diepeveen wrote:

> >> "If you were plowing a field, which would you rather use? Two strong oxen
> >> or 1024 chickens?"
> >
> >I'd take the chickens any day, if each chicken could plow 1/1024th of
> >the field faster than the two oxen could do the whole thing.
> >
> >And they only cost chickenfeed...;-)
> The fundamental problem is you need a $2 million network for your 1024
> chickens.
> All i need with the virtual 1 tflop processor is some of that XDR memory,
> add a few memory banks at the mainboard which are each independantly
> connected to one of the 4 sides of the chip. 

Ah, Vincent, I begin to see your point.  We are (not surprisingly, I
suppose) talking about two different things.  I thought that your
remarks were directed at the discussion on VECTOR PROCESSORS.  Vector
processors are speciality processors designed to do one thing only, but
do it very well.  Take in one or more vectors of data on the right,
perform any one of several relatively simple/linear transformations on
each element (generally the same transformation or set of
transformations to every element) and spit out a vector of results on
the left.  Classic examples of vector processors are e.g. digital signal
processors or graphics processing units -- both take a fairly
predictable stream of vectorized data and wreak a fairly standard set of
transformations upon it.  See:


Because they don't have to waste all that processors space on the
implementation of general purpose instructions, they can stack up nice
little floating point pipelines much deeper than a general purpose CPU,
and run many M successive steps of a vector transformation at once, where
it is applying instruction 1 to the Mth data object at the same time it
is applying instruction M to the 1st data object in a stream of N
objects.  If N is much much bigger than M, and if the processor doesn't
have to do annoying things like compare each result to something and
maybe branch and flush and refill the entire pipeline, this can reduce
the time required per instruction to something like 1/Mth of the clock
(presuming that the logic can retire one instruction per clock cycle at
each level).

Note that general purpose CPUs often have pipelines these days and can
manage "vector instructions" but the pipelines are not very deep --
typically what, 2-3 instructions?  -- and often support only a few
speciality floating point operations.  This is enough to enable
relatively fast simple transforms that involve e.g. multiply-adds (a
very common instruction pair in linear algebra and one of the things
studied by stream).

Thus a perrennial question discussed on the list is whether or not
beowulfs equipped with attached DSP or GPUs to do vector instructions is
a "win" in terms of performance for people with the right code.  This
often appears in the context of a discussion of whether or not to build
a cluster out of the latest round of game boxes (e.g. gamecubes,
playstations, whatever).  This is because these boxes, these days, have
enough general purpose power to be able to run a simple operating system
including a network, but have dedicated graphics processors so that they
can do their game graphics transformations very quickly.  

There have been "clusters" with vector processors attached to scalar
processors sold as (usually SIMD) supercomputers in years past, e.g.
Thinking Machines CM series -- consensus among my friends who used them
are that they were a godawful pain to program but sure, quite speedy for
the right code.  I think that the accuracy of this description persists
to this very day for this sort of architecture.

Now, I was pointing out the following simple fact.  Memory speeds are
MUCH MUCH SLOWER than CPU speeds, and for that matter, pretty much
always have been.  Furthermore, faster memory (for any given memory
generation) costs MUCH more than slower memory.  That is WHY there is a
memory hierarchy on EVERY system ever built.  It runs something like the
following simplified picture with times that are best of my recollection
and may not be current (they very definitely change as systems evolve,
although more slowly than Moore's Law advances raw CPU speed, creating a
widening gap):

  register memory: on CPU -- fastest, access in ~1 clock as part of
execution cycle

  L1 cache: static RAM, on CPU -- very fast, access in a few clocks (<=1

  L2 cache: static RAM, also "on" most modern CPUs but hierarchically
lives between L1 and either L3 on the few systems that have a third
cache layer or more commonly DRAM -- main memory.  access in ~1-5 ns.

  DRAM: main dynamic memory on most computers.  A typical modern design
pays a relatively high latency hit for reading at a new address (say 40
ns) followed by much lower hits for streaming sequential reads beginning
at the first location -- IIRC around 10 ns each, but this changes fast
enough that I wouldn't be surprised if my numbers are obsolete.  How
much data gets clocked in per cycle depends on many things such as the
width of the datapath.

  Virtual memory:  swap, NUMA, non-NUMA network partitioned memory.
Typically BOTH high latency AND relatively low bandwidth, although NUMA
SMP designs can be RELATIVELY fast since that's the whole point.
Numbers can range from about an order of magnitude slower than DRAM to
many orders of magnitude slower, especially for latency-bounded access
to disk-based swap where disk latencies are typically in the MILLIsecond
range.  If you are running code that "needs" VM to fit, you're either
dead (it can be viewed as a code design "bug", as I think Mark pointed
out) or you have to work very hard organizing your code to get
"acceptable" speeds

Note that with the exception of register memory all memory hits on a
system SLOW the system relative to its theoretical instruction execution
rate.  Instructions are typically retired in 1/2-2 clock cycles
(depending on pipelining and parallelism) which is a sub-ns time on
modern CPUs.  Accessing memory outside of L1 takes many clocks, although
lots of things work to make those accesses run in parallel with
execution so that the CPU has to wait a minimal amount of time to get
data/code into its working memory.

The whole point of having the memory hierarchy is that if your program,
or data, have any significant degree of locality, having the cache hides
the slowness of the DRAM subsystem, which is a factor of 30-100 slower
than operational register memory on the CPU itself.  If your code is 90%
local at any given layer of cache, the average time required to run your
code is something like 0.9 T_s + 0.1 T_l where T_s is the short time to
get something out of (say) L_2 vs the long time T_l to get something out
of DRAM.  In ns, this could be something like 0.9 1 + 0.1 10 = 2 ns
average access time instead of 10 ns average access time to get the same
data straight out of DRAM without caching.

The effect is especially great when looking at random access patterns,
where one pays the 40 ns latency hit in DRAM.  benchmaster has a
"shuffled" mode that lets you look at random read, random write, random
read/write rates to big blocks of memory, and memory performance really
goes straight to hell compared to streaming for cache-defeating
patterns.  I do mean hell:

rgb at lucifer|B:1115>./benchmaster -t 6 -s 1000000 -r
avg full = 1.508151e+08 min = 1.502259e+08  max = 1.519108e+08
Content-Length: 1065

<?xml version="1.0"?>
  <version>Benchmaster 1.1.2</version>
    <vendor_id> GenuineIntel</vendor_id>
    <CPU name> Intel(R) Pentium(R) 4 CPU 1.80GHz</CPU name>
    <CPU clock units="Mhz"> 1804.520</CPU clock>
    <l2cache units="KB"> 512 KB</l2cache>
    <memtotal units="KB">515840</memtotal>
    <memfree units="KB">104832</memfree>
    <nanotimer>cpu cycle counter nanotimer</nanotimer>
    <nanotimer_granularity units="nsec">99.144</nanotimer_granularity>
    <name>memory read/write</name>
    <args>-t 6 -s 1000000 -r</args>
    <description>Reads, then writes 4 byte integer vector</description>
    <time units="nsec">1.48e+02</time>
    <time_stddev units="nsec">3.97e-02</time_stddev>
    <min_time units="nsec">1.48e+02</min_time>
    <max_time units="nsec">1.49e+02</max_time>
    <rate units="10e+6">6.74e+00</rate>

Note the >>148 nsec<< access time required to read, then write, a single
integer into/out of a shuffled/random memory location in a vector much
larger than cache.  This is much LARGER than you'd naively expect just
from the properties of DDR on a 500 FSB system.  Compare it to doing the
SAME TASK (complete to the indirected addressing, but with an unshuffled
addressing vector):

rgb at lucifer|B:1116>./benchmaster -t 6 -s 1000000
avg full = 8.291473e+06 min = 8.127370e+06  max = 8.964347e+06
Content-Length: 1055

<?xml version="1.0"?>
  <version>Benchmaster 1.1.2</version>
    <vendor_id> GenuineIntel</vendor_id>
    <CPU name> Intel(R) Pentium(R) 4 CPU 1.80GHz</CPU name>
    <CPU clock units="Mhz"> 1804.520</CPU clock>
    <l2cache units="KB"> 512 KB</l2cache>
    <memtotal units="KB">515840</memtotal>
    <memfree units="KB">104832</memfree>
    <nanotimer>cpu cycle counter nanotimer</nanotimer>
    <nanotimer_granularity units="nsec">99.920</nanotimer_granularity>
    <name>memory read/write</name>
    <args>-t 6 -s 1000000</args>
    <description>Reads, then writes 4 byte integer vector</description>
    <time units="nsec">5.81e+00</time>
    <time_stddev units="nsec">2.66e-02</time_stddev>
    <min_time units="nsec">5.65e+00</min_time>
    <max_time units="nsec">6.48e+00</max_time>
    <rate units="10e+6">1.72e+02</rate>

Down to 5 nsec per integer -- a more efficient addressing loop would
probably drop this a hair.  The memory is no faster, but now cache
"works" for a factor of 30 speedup.  Go down still further in size:

<?xml version="1.0"?>
  <version>Benchmaster 1.1.2</version>
    <vendor_id> GenuineIntel</vendor_id>
    <CPU name> Intel(R) Pentium(R) 4 CPU 1.80GHz</CPU name>
    <CPU clock units="Mhz"> 1804.520</CPU clock>
    <l2cache units="KB"> 512 KB</l2cache>
    <memtotal units="KB">515840</memtotal>
    <memfree units="KB">104832</memfree>
    <nanotimer>cpu cycle counter nanotimer</nanotimer>
    <nanotimer_granularity units="nsec">99.900</nanotimer_granularity>
    <name>memory read/write</name>
    <args>-t 6 -s 1000</args>
    <description>Reads, then writes 4 byte integer vector</description>
    <time units="nsec">3.47e+00</time>
    <time_stddev units="nsec">7.70e-03</time_stddev>
    <min_time units="nsec">3.41e+00</min_time>
    <max_time units="nsec">3.62e+00</max_time>
    <rate units="10e+6">2.88e+02</rate>

into RW into L1, and you see times approaching 50x faster, where again
some of this is double-indexing overhead.  Even here random access
forces one to give back a factor of 10 in speed -- it is faster to
access a streaming int vector out of DRAM than it is to access data out
of L1 in random order.  The CPU really "wants" data access to be
sequential and local, and makes you pay for it in time and idle cycles
when it is not.

The net result of all of this is that a LOT of what you perceive of as
"raw CPU speed" on any given computer is an illusion -- smoke and
mirrors -- perpetrated by a very clever memory hierarchy and will vanish
instantly for the wrong kind of cache defeating work. There are all
sorts of lovely human metaphors -- some work you can do "in your head",
some requires information in papers on your desk that you have to put
into your head, some involves papers in filing cabinets in your office
(and indeed all your tasks START by getting out some particular file or
set of files), some of those files have to get shipped in from the main
office across town and all your results ultimately have to be shipped
back to the main office when you're done.  Same thing.  You can be a
super-genius and do the "brain work" for each file in zero time and only
be a little faster than the plodding Joe next door if it takes both of
you longer to get the files from across town into your local file
cabinets, get the files from there onto your desks, and get the data
from the files and into your respective heads than it does for either
one of you to solve the actual problem once it is there.

This is not always the case, of course, for HPC and computer tasks in
general any more than it is for humans.  Humans might WELL take on a
task and require days or weeks to solve it from their "desktop" compared
to minutes to hours to obtain the task and file away results.  Plenty of
HPC tasks involve things like computing trancendental functions or doing
division inside the primary loop, tasks that take a LONG time and give
the memory subsystems plenty of time to keep up with the CPU.

So one invents a term to differentiate task types:  "memory bound" tasks
are ones where memory I/O is the rate-determining bottleneck that limits
performance.  "CPU bound" tasks are ones whose completion time scales
with the inverse of CPU clock, more or less independent of the
particular memory hierarchy involved.  On parallel architectures,
"interprocessor communications (IPC) bound" is an extension of the
memory bound idea to distributed tasks; the IPC bottleneck has profound
effects on the parallel SCALING of tasks across a set of nodes.

>From what you are saying (in this and previous communications) your
particular application, computer chess, is in the relatively happy
regime where the application is CPU bound, hence your somewhat cavalier
attitude to the VERY DIFFICULT AND EXPENSIVE TASK of designing the
CPU-memory interface on modern computers.  I don't think you appreciate
how serious a problem this is, or how much serious money e.g. Intel or
AMD spend on solving it.

Whether or not you are memory bound can be determined easily enough from
the task scaling -- if it takes (say) 4 seconds to determine its next
move from some given configuration on a 500 MHz P6-family CPU, 2 seconds
on a 1 GHz P6-family CPU, and 1 second on a 2 GHz P6-family CPU
(independent of the amount or kind of memory, and probably even
independent of the amount of L2 cache) then yes, it is probably CPU
bound.  Or use minutes or hours if appropriate.

Guessing that a chess program spends most of its time doing character or
integer tasks and comparisons -- processing large integer-indexed
outcome trees of integer-sized binary encoded information with boolean
instructions and indexing arithmetic -- one would expect the ENTIRE
discussion on vector processors to be pretty much irrelevant to your
application, as the VPs in question only are useful for doing streaming
floating point arithmetic, and generally only for a fairly specific
subset of numerical instructions and fairly specific data sizes at that.
It isn't clear that they would be at ALL useful, or even usable, for/by
your program.

If, OTOH, the program perhaps STARTS by scaling with clock for
relatively simple board configurations but then grinds to a metaphorical
halt and no longer scales with CPU clock for more complex ones; it is
quite possible that it starts CPU bound (for configurations that fit
into L2 or are sufficiently local in memory that the architecture's
cache fill algorithms "work") and then crosses over as the program has
to look further and further away in memory (and in more and more random
ways) to work out the outcome trees.  In this case you'd observe a very
definite cache effect -- larger processors wouldn't necessarily run
instructions faster at any given clock but they WOULD run out of cache
more often and avoid that nasty random access latency hit and hence
would finish complex trees faster.  You'd likely also observe a fairly
straightforward dependence on memory type and speed, given that you're
running on systems as old as 500 MHz PIII's, with their relatively slow
FSB and relatively inefficient MMUs.  Faster memory would produce
immediate improvement independent of CPU speed, or on top of it as I
don't think you can get 500 MHz systems with faster memory but rather
have to get faster systems with faster memory.

It wouldn't be surprising for those faster systems to relatively
UNDERPERFORM CPU clock speed increases if your task IS partly memory
bound, because of the aforementioned memory speed gap.  Bleeding edge
CPU clock speeds are about a factor of 8 faster than a 500 MHz PIII, but
DRAM memory speeds have not gone up this much.

> Even if that XDR rambuss grossely expensive at say $1000 for 2 gigabyte, 
> then i put in 4 dimms of it at each independant memory controller of the
> chip, 
> delivering a total bandwidth of 100 gigabyte a second.
> Even a medium engineer can add on die memory controllers to a chip.
> The extreme expensive mainboard i buy for $2500 and i feel ripped off.
> The extreme expensive cell processor i buy for $3000 and i feel ripped off,
> But even then at a total price of say $10k the system is eating your $3
> million 1024 chicken system alive.
> See my point?
> Whatever price the 1024 chickens have, your network costs several million.

No, no, no.  First, the "oxen" you are proposing as an alternative
simply doesn't exist -- we'd ALL love to have cheap 1 THz processors,
and in lessee (log_2(25) = 4.64)x 2 = 9.3 years we might have them, if
Moore's law holds out that long.  If they did exist, we'd yoke them
together and call them chickens and build QHz cluster out of them so we
could finally count the number of angels that can dance on the head of a
pin (the answer is doubtless 42, but we have to be sure:-).

Second, many a 1024 chicken cluster is very productive indeed with a
100BT network that costs maybe $10/chic-- I mean node;-) Other clusters
can get by with gigabit ethernet at perhaps $100/node.  Then sure, there
are high end clusters that spend as much or more on the network per node
as they do on the nodes themselves, and intermediate RDMA and TCP/IP
managing gigE NICs that come in somewhere in between.

All of these options make SENSE and are COST-BENEFIT OPTIMAL for various
classes of problems -- the 100 BT solution for all embarrassingly
parallel problems and quite a few "coarse grained" parallel problems
(e.g. rendering, computing nifty mandelbrot set pictures, maybe even
chess), the successively faster network solutions for parallel problems
with IPCs that are increasingly bandwith and/or latency bound.  The idea
is to pick a cluster architecture for which your problem is UNbound by
IPCs out to some desireable degree of speedup, in the most affordable
way, not to build the world's biggest cluster as a marketing ploy
(however often the latter is what actually happens).

It is worth pointing out NOW that it is well known that there are whole
classes of problem that do NOT parallelize efficiently on any available
hardware including even the most advanced networks.  A cluster or
supercomputer of ANY sort do not always make sense and there is a whole
computer science of the scaling of the diminishing returns.  Network I/O
is still hovering around 1 microsecond latency and bandwidths in the
range of billions of bits per second.  The former is order(s) of
magnitude slower than the slowest DRAM; the latter is order(s) of
magnitude slower than the slowest DRAM.  You cannot just refill
registers on your CPU from a memory location on a remote system anywhere
nearly as fast as you can refill it from your own DRAM, and in many
cases you have to FIRST move memory from DRAM to DRAM and THEN let
operations move it into L2, L1, registers and so on, so it is strictly
an ADDITIONAL hit timewise.  For single-threaded tasks performing
serialized instructions, there is no advantage to using a cluster unless
the task is so big that it won't fit into local memory, in which case a
network-cluster-based VM extension may well be the fastest possible
solution that lets you run at all (whether or not it is "fast enough" to
do what you need to do).

NOTE WELL!  Speedup, slowdown, and what kind of network suffices to
represent a "solution" to any given problem depends STRONGLY on task
organization.  You cannot just say "my task doesn't run any faster on a
cluster".  You may have to COMPLETELY REORGANIZE AND REWRITE your task
so it WILL run faster on a cluster, and the reorganization and cluster
design it will run on are INTEGRATED in the final solution so one of
the former will by no means fit all of the latter.

To make the discussion concrete, let's address two tasks.  We just heard
from somebody that discovered the simplest possible form of parallelism
-- instead of trying to partition an entire computation and all of its
linear algebra components to run efficiently on any kind of cluster (and
most modern "real supercomputers" are just clusters with custom
memory/cpu interconnects -- an internal "memory" if you like) he just
ran multiple instances of the serialized job.  Why not?  He had lots of
runs to make, and doing them at the same time on N nodes gets them done
in 1/N of the time of a single node!  This is the IDEAL!  The more
"like" this you can make your task, or any significant PART of your
task, the better your parallel scaling will be and the greater the
parallel speedup you will observe.

Then lets think about YOUR task, parsing out chess moves.  So far, it
sounds like the code that does it is pretty much strictly serialized.
Strictly parallelizing the serialized portions within any given tree may
not be particularly efficient, as it may take longer to communicate
state at each succeeding level than the task would take to run if it
didn't pause to split off and communicate each tree node to a real node.

What you'd like to do is find a task reorganization that makes the
entire task partition, so "parts" of the computation can be advanced
independently on completely different systems after being given
relatively little starting information.  In the case of chess, I'd guess
that one trivial parallelization is to split up all the pieces that CAN
move in any given configuration so that each node solves the tree of
outcomes for just moving that one piece.  Prune the results, and when
any given piece dead-ends (and its node becomes free again) split off
single-moove components of the productive trees recursively.  Arrange it
(if possible) so that communications are rare per node compared to node
computation and you'll have a winner in terms of parallel speedup.

This presumes that it is possible to make local decisions about the
quality of positions that are sufficient to accept or reject them with
relatively small IPCs (communicating the general range of "best position
so far" across the cluster).  I'm guessing that MOST possible initial
moves at any given level lead to positions that are "obviously bad" and
one only needs to pursue them far enough to overcome possible
optimization humps that separate these "bad" positions from a really
really good one accessible in only this way.  Which tempts me to digress
upon GA's, simulated annealing, and discrete or continuous optimization
theory in general, but I will manfully resist...:-)

Or something -- I haven't studied the problem and you have, but IF you
can find a partitioning that REMAINS CPU bound or only weakly memory
bound on a cluster and involves only relatively "small" IPCs in order to
set up and manage the partitioning, then a cluster becomes a good way to
get things to speed up a lot for only a little money.  The more data you
have to exchange in the partitioning, the more you have to spend
(relatively speaking) on network compared to CPU, and the more "IPC
bound" the parallel task scaling will be.

> A 12288 chicken system from IBM costs to be precise 8 million euro,
> delivering 27.5 tflop. Thanks to the network it is not a chicken system but
> europese fastest supercomputer.

Sure, sure, sure.  But THIS list is as much about a 2.5 K$ 8-node
cluster as it is about an 8 ME 12288-node cluster.  "Europe's Fastest
Supercomputer" with all those nodes and that really expensive network
isn't worth a damn to somebody running a serialized program, and it may
not be worth much to somebody running a fine grained synchronous
parallel program whose parallel scaling peaks at 32 nodes.  It will be
MOST useful to people who run embarrassingly parallel programs, and the
main point that I can see of having such an agressively large size is
that it will let lots of people run much smaller programs at the same
time, or let somebody with a 32-node task can run 384 instances of it at
once, in an embarrassingly parallel way.  And of course it will get a
really high linpack score and hence end up near the top of the Top 500
list, which is good marketing and almost certainly the REAL point of the
whole thing given the paucity of problems that might actually use the
entire resource constructively all at once.

> By the way, my chessprogram doesn't output 2 bytes, but it outputs 4 bytes
> each few minutes. 
> A chess move.
> If you have a processor that can run it, just outputting 4 bytes each few
> minutes, at a speed equal to a 1 Thz opteron processor, then i can sell
> 100000 machines of that hands down if its price is that of a normal PC.
> At a 1 thz machine it will output d2d4 anyway the first move. In
> correspondence notation 'd2' is the field 42 and 'd4' is 44.
> So it's output is 42 anyway.

Right, but what if you could partition the "next move" problem so that
it ran on a $2500 8 processor cluster where each processor provided you
with (say) 3 GHz of processing power today (instead of an entirely
mythical 1 THz of single processor power).  That gives you 24 GHz of
processing power for about what a high end gaming system costs from
Dell, and about 0.1 THz (100 GHz) for only $10K and 32 nodes.  Even if
you end up doubling the per-node cost and get better/faster nodes or a
better network after you finish analyzing the problem decomposition,
it is still pretty cheap FOR WHAT YOU WANT TO ACCOMPLISH which is a
nontrivial computational task.

My final advice is to think carefully about problem decomposition and
recursion.  AFAICT, chess in general has a natural decomposition and
recursion structure, where although the space of possible moves grows
superexponentially, it also prunes almost as fast so that slow old human
brains reject most possible move sequences almost before they start to
really work.  If you can figure out any way at all to do "the same
thing" on a cluster, to carry out the pruning based on local or
almost-local information, you should be able to realize linear or
near-linear speedup on very pedestrian cluster designs, ones with CHEAP
networks and CHEAP nodes.  It seems like a fun problem, even.


> Vincent
> >   rgb
> >
> >>   Seymour Cray
> >> 
> >> Vincent
> >> _______________________________________________
> >> Beowulf mailing list, Beowulf at beowulf.org
> >> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> >> 
> >
> >-- 
> >Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
> >Duke University Dept. of Physics, Box 90305
> >Durham, N.C. 27708-0305
> >Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
> >
> >
> >
> >

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

More information about the Beowulf mailing list