[Beowulf] Cell

Vincent Diepeveen diep at xs4all.nl
Wed Apr 27 15:02:14 PDT 2005



At 05:41 PM 4/27/2005 -0400, Robert G. Brown wrote:
>> If you can deliver 1 processor that can do 1 tflop, there is no need for
>> bandwidth anymore, everything happens on that chip in such a case :)
>
>???
>
>Perhaps I'm highly confused, but I thought vector units operated on
>DATA, and that that data had to live in MEMORY.
>
>Memory, in turn, has some fairly stringent limits on the rate at which
>it can be accessed -- very definitely finite in terms of both latency
>and bandwidth.  Furthermore, for streaming vector-like operations in at
>least NORMAL PC-like architectures (as opposed to supercomputer
>architectures where a Lot Of Money (LOM) is spent feeding the processor
>I would expect memory bandwidth to be the fundamental bottleneck in
>streaming/vector operations.
>
>To be more explicit -- if you run stream on most reasonably modern
>systems, stream "copy" takes a time that is very comparable to to stream
>"scale".  For example, on a P4 I have handy:
>
>rgb at lucifer|B:1100>./benchmaster -t 2 -s 1000000 -i 1 -n 10
>avg full = 1.497222e+07 min = 1.492613e+07  max = 1.513106e+07
>Content-Length: 1067
>
><?xml version="1.0"?>
><benchml>
>  <version>Benchmaster 1.1.2</version>
>  <hostinfo>
>    <hostname>lucifer</hostname>
>    <vendor_id> GenuineIntel</vendor_id>
>    <CPU name> Intel(R) Pentium(R) 4 CPU 1.80GHz</CPU name>
>    <CPU clock units="Mhz"> 1804.520</CPU clock>
>    <l2cache units="KB"> 512 KB</l2cache>
>    <memtotal units="KB">515840</memtotal>
>    <memfree units="KB">105536</memfree>
>    <nanotimer>cpu cycle counter nanotimer</nanotimer>
>    <nanotimer_granularity units="nsec">98.158</nanotimer_granularity>
>  </hostinfo>
>  <benchmark>
>    <name>stream copy</name>
>    <command>./benchmaster</command>
>    <args>-t 2 -s 1000000 -i 1 -n 10</args>
>    <description>d[i] = a[i] (standard is -s 1000000 -i 1 -n
>10)</description>
>    <iterations>1</iterations>
>    <size>1000000</size>
>    <stride>1</stride>
>    <time units="nsec">1.50e+01</time>
>    <time_stddev units="nsec">1.80e-02</time_stddev>
>    <min_time units="nsec">1.49e+01</min_time>
>    <max_time units="nsec">1.51e+01</max_time>
>    <rate units="10e+6">1.07e+03</rate>
>  </benchmark>
></benchml>
>
>
>rgb at lucifer|B:1104>./benchmaster -t 3 -s 1000000 -i 1 -n 10
>avg full = 1.489518e+07 min = 1.486376e+07  max = 1.494210e+07
>Content-Length: 1075
>
><?xml version="1.0"?>
><benchml>
>  <version>Benchmaster 1.1.2</version>
>  <hostinfo>
>    <hostname>lucifer</hostname>
>    <vendor_id> GenuineIntel</vendor_id>
>    <CPU name> Intel(R) Pentium(R) 4 CPU 1.80GHz</CPU name>
>    <CPU clock units="Mhz"> 1804.520</CPU clock>
>    <l2cache units="KB"> 512 KB</l2cache>
>    <memtotal units="KB">515840</memtotal>
>    <memfree units="KB">105536</memfree>
>    <nanotimer>cpu cycle counter nanotimer</nanotimer>
>    <nanotimer_granularity units="nsec">98.118</nanotimer_granularity>
>  </hostinfo>
>  <benchmark>
>    <name>stream scale</name>
>    <command>./benchmaster</command>
>    <args>-t 3 -s 1000000 -i 1 -n 10</args>
>    <description>d[i] = xtest*d[i] (standard is -s 1000000 -i 1 -n
>10)</description>
>    <iterations>1</iterations>
>    <size>1000000</size>
>    <stride>1</stride>
>    <time units="nsec">1.49e+01</time>
>    <time_stddev units="nsec">8.47e-03</time_stddev>
>    <min_time units="nsec">1.49e+01</min_time>
>    <max_time units="nsec">1.49e+01</max_time>
>    <rate units="10e+6">1.07e+03</rate>
>  </benchmark>
></benchml>
>
>Note that these times are already identical.  The "scale" operation is
>done in parallel with I/O and takes "zero time".  Clearly the
>memory-to-CPU I/O is the bottleneck, not the processor, which is
>"infinitely fast" as far as this operation is concerned.
>
>Comparing triad:
>
>
>
>rgb at lucifer|B:1099>./benchmaster -t 5 -s 1000000 -i 1 -n 10
>avg full = 1.982203e+07 min = 1.980236e+07  max = 1.985822e+07
>Content-Length: 1081
>
><?xml version="1.0"?>
><benchml>
>  <version>Benchmaster 1.1.2</version>
>  <hostinfo>
>    <hostname>lucifer</hostname>
>    <vendor_id> GenuineIntel</vendor_id>
>    <CPU name> Intel(R) Pentium(R) 4 CPU 1.80GHz</CPU name>
>    <CPU clock units="Mhz"> 1804.520</CPU clock>
>    <l2cache units="KB"> 512 KB</l2cache>
>    <memtotal units="KB">515840</memtotal>
>    <memfree units="KB">105144</memfree>
>    <nanotimer>cpu cycle counter nanotimer</nanotimer>
>    <nanotimer_granularity units="nsec">98.244</nanotimer_granularity>
>  </hostinfo>
>  <benchmark>
>    <name>stream triad</name>
>    <command>./benchmaster</command>
>    <args>-t 5 -s 1000000 -i 1 -n 10</args>
>    <description>d[i] = a[i] + xtest*b[i] (standard is -s 1000000 -i 1
>-n 10)</description>
>    <iterations>1</iterations>
>    <size>1000000</size>
>    <stride>1</stride>
>    <time units="nsec">1.98e+01</time>
>    <time_stddev units="nsec">6.08e-03</time_stddev>
>    <min_time units="nsec">1.98e+01</min_time>
>    <max_time units="nsec">1.99e+01</max_time>
>    <rate units="10e+6">1.21e+03</rate>
>  </benchmark>
></benchml>
>
>shows a slowdown of roughly 1/3 relative to copy (2 ns per loop pass
>instead of 1.5 ns per loop pass). Again, on-chip parallelism and the
>efficiency of the memory interface are clearly exhibited.
>
>In other words, if you stuck a "1 TFLOP" vector processor into this
>particular MEMORY architecture and ran stream, I wouldn't expect to see
>much speedup.  The bulk of the time required for the actual numerical
>operations is already hidden by the memory bottleneck and design
>parallelism.  It isn't clear how much a larger cache would help, either,
>on streaming code that doesn't reuse any memory location -- there comes
>a point of diminishing returns where you're stuck at fundamental
>bandwidth again, waiting for the cache to refill, when the CPU is much
>faster.  I tend to think of large caches as being useful to people with
>lots of contexts to juggle more than a necessary advantage to people
>running LARGE vector operations, although they are an obvious advantage
>to folks whose vectors can fit into cache:-)
>
>> "If you were plowing a field, which would you rather use? Two strong oxen
>> or 1024 chickens?"
>
>I'd take the chickens any day, if each chicken could plow 1/1024th of
>the field faster than the two oxen could do the whole thing.
>
>And they only cost chickenfeed...;-)

The fundamental problem is you need a $2 million network for your 1024
chickens.

All i need with the virtual 1 tflop processor is some of that XDR memory,
add a few memory banks at the mainboard which are each independantly
connected to one of the 4 sides of the chip. 

Even if that XDR rambuss grossely expensive at say $1000 for 2 gigabyte, 

then i put in 4 dimms of it at each independant memory controller of the
chip, 
delivering a total bandwidth of 100 gigabyte a second.

Even a medium engineer can add on die memory controllers to a chip.

The extreme expensive mainboard i buy for $2500 and i feel ripped off.
The extreme expensive cell processor i buy for $3000 and i feel ripped off,

But even then at a total price of say $10k the system is eating your $3
million 1024 chicken system alive.

See my point?

Whatever price the 1024 chickens have, your network costs several million.

A 12288 chicken system from IBM costs to be precise 8 million euro,
delivering 27.5 tflop. Thanks to the network it is not a chicken system but
europese fastest supercomputer.

By the way, my chessprogram doesn't output 2 bytes, but it outputs 4 bytes
each few minutes. 

A chess move.

If you have a processor that can run it, just outputting 4 bytes each few
minutes, at a speed equal to a 1 Thz opteron processor, then i can sell
100000 machines of that hands down if its price is that of a normal PC.

At a 1 thz machine it will output d2d4 anyway the first move. In
correspondence notation 'd2' is the field 42 and 'd4' is 44.

So it's output is 42 anyway.

Vincent

>   rgb
>
>>   Seymour Cray
>> 
>> Vincent
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>> 
>
>-- 
>Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
>Duke University Dept. of Physics, Box 90305
>Durham, N.C. 27708-0305
>Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
>
>
>
>



More information about the Beowulf mailing list