[Beowulf] Cell

Robert G. Brown rgb at phy.duke.edu
Wed Apr 27 14:41:56 PDT 2005


> If you can deliver 1 processor that can do 1 tflop, there is no need for
> bandwidth anymore, everything happens on that chip in such a case :)

???

Perhaps I'm highly confused, but I thought vector units operated on
DATA, and that that data had to live in MEMORY.

Memory, in turn, has some fairly stringent limits on the rate at which
it can be accessed -- very definitely finite in terms of both latency
and bandwidth.  Furthermore, for streaming vector-like operations in at
least NORMAL PC-like architectures (as opposed to supercomputer
architectures where a Lot Of Money (LOM) is spent feeding the processor
I would expect memory bandwidth to be the fundamental bottleneck in
streaming/vector operations.

To be more explicit -- if you run stream on most reasonably modern
systems, stream "copy" takes a time that is very comparable to to stream
"scale".  For example, on a P4 I have handy:

rgb at lucifer|B:1100>./benchmaster -t 2 -s 1000000 -i 1 -n 10
avg full = 1.497222e+07 min = 1.492613e+07  max = 1.513106e+07
Content-Length: 1067

<?xml version="1.0"?>
<benchml>
  <version>Benchmaster 1.1.2</version>
  <hostinfo>
    <hostname>lucifer</hostname>
    <vendor_id> GenuineIntel</vendor_id>
    <CPU name> Intel(R) Pentium(R) 4 CPU 1.80GHz</CPU name>
    <CPU clock units="Mhz"> 1804.520</CPU clock>
    <l2cache units="KB"> 512 KB</l2cache>
    <memtotal units="KB">515840</memtotal>
    <memfree units="KB">105536</memfree>
    <nanotimer>cpu cycle counter nanotimer</nanotimer>
    <nanotimer_granularity units="nsec">98.158</nanotimer_granularity>
  </hostinfo>
  <benchmark>
    <name>stream copy</name>
    <command>./benchmaster</command>
    <args>-t 2 -s 1000000 -i 1 -n 10</args>
    <description>d[i] = a[i] (standard is -s 1000000 -i 1 -n
10)</description>
    <iterations>1</iterations>
    <size>1000000</size>
    <stride>1</stride>
    <time units="nsec">1.50e+01</time>
    <time_stddev units="nsec">1.80e-02</time_stddev>
    <min_time units="nsec">1.49e+01</min_time>
    <max_time units="nsec">1.51e+01</max_time>
    <rate units="10e+6">1.07e+03</rate>
  </benchmark>
</benchml>


rgb at lucifer|B:1104>./benchmaster -t 3 -s 1000000 -i 1 -n 10
avg full = 1.489518e+07 min = 1.486376e+07  max = 1.494210e+07
Content-Length: 1075

<?xml version="1.0"?>
<benchml>
  <version>Benchmaster 1.1.2</version>
  <hostinfo>
    <hostname>lucifer</hostname>
    <vendor_id> GenuineIntel</vendor_id>
    <CPU name> Intel(R) Pentium(R) 4 CPU 1.80GHz</CPU name>
    <CPU clock units="Mhz"> 1804.520</CPU clock>
    <l2cache units="KB"> 512 KB</l2cache>
    <memtotal units="KB">515840</memtotal>
    <memfree units="KB">105536</memfree>
    <nanotimer>cpu cycle counter nanotimer</nanotimer>
    <nanotimer_granularity units="nsec">98.118</nanotimer_granularity>
  </hostinfo>
  <benchmark>
    <name>stream scale</name>
    <command>./benchmaster</command>
    <args>-t 3 -s 1000000 -i 1 -n 10</args>
    <description>d[i] = xtest*d[i] (standard is -s 1000000 -i 1 -n
10)</description>
    <iterations>1</iterations>
    <size>1000000</size>
    <stride>1</stride>
    <time units="nsec">1.49e+01</time>
    <time_stddev units="nsec">8.47e-03</time_stddev>
    <min_time units="nsec">1.49e+01</min_time>
    <max_time units="nsec">1.49e+01</max_time>
    <rate units="10e+6">1.07e+03</rate>
  </benchmark>
</benchml>

Note that these times are already identical.  The "scale" operation is
done in parallel with I/O and takes "zero time".  Clearly the
memory-to-CPU I/O is the bottleneck, not the processor, which is
"infinitely fast" as far as this operation is concerned.

Comparing triad:



rgb at lucifer|B:1099>./benchmaster -t 5 -s 1000000 -i 1 -n 10
avg full = 1.982203e+07 min = 1.980236e+07  max = 1.985822e+07
Content-Length: 1081

<?xml version="1.0"?>
<benchml>
  <version>Benchmaster 1.1.2</version>
  <hostinfo>
    <hostname>lucifer</hostname>
    <vendor_id> GenuineIntel</vendor_id>
    <CPU name> Intel(R) Pentium(R) 4 CPU 1.80GHz</CPU name>
    <CPU clock units="Mhz"> 1804.520</CPU clock>
    <l2cache units="KB"> 512 KB</l2cache>
    <memtotal units="KB">515840</memtotal>
    <memfree units="KB">105144</memfree>
    <nanotimer>cpu cycle counter nanotimer</nanotimer>
    <nanotimer_granularity units="nsec">98.244</nanotimer_granularity>
  </hostinfo>
  <benchmark>
    <name>stream triad</name>
    <command>./benchmaster</command>
    <args>-t 5 -s 1000000 -i 1 -n 10</args>
    <description>d[i] = a[i] + xtest*b[i] (standard is -s 1000000 -i 1
-n 10)</description>
    <iterations>1</iterations>
    <size>1000000</size>
    <stride>1</stride>
    <time units="nsec">1.98e+01</time>
    <time_stddev units="nsec">6.08e-03</time_stddev>
    <min_time units="nsec">1.98e+01</min_time>
    <max_time units="nsec">1.99e+01</max_time>
    <rate units="10e+6">1.21e+03</rate>
  </benchmark>
</benchml>

shows a slowdown of roughly 1/3 relative to copy (2 ns per loop pass
instead of 1.5 ns per loop pass). Again, on-chip parallelism and the
efficiency of the memory interface are clearly exhibited.

In other words, if you stuck a "1 TFLOP" vector processor into this
particular MEMORY architecture and ran stream, I wouldn't expect to see
much speedup.  The bulk of the time required for the actual numerical
operations is already hidden by the memory bottleneck and design
parallelism.  It isn't clear how much a larger cache would help, either,
on streaming code that doesn't reuse any memory location -- there comes
a point of diminishing returns where you're stuck at fundamental
bandwidth again, waiting for the cache to refill, when the CPU is much
faster.  I tend to think of large caches as being useful to people with
lots of contexts to juggle more than a necessary advantage to people
running LARGE vector operations, although they are an obvious advantage
to folks whose vectors can fit into cache:-)

> "If you were plowing a field, which would you rather use? Two strong oxen
> or 1024 chickens?"

I'd take the chickens any day, if each chicken could plow 1/1024th of
the field faster than the two oxen could do the whole thing.

And they only cost chickenfeed...;-)

   rgb

>   Seymour Cray
> 
> Vincent
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu





More information about the Beowulf mailing list