Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Cell

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Vincent Diepeveen diep at xs4all.nl
Wed Apr 27 15:02:14 PDT 2005



At 05:41 PM 4/27/2005 -0400, Robert G. Brown wrote:
>> If you can deliver 1 processor that can do 1 tflop, there is no need for
>> bandwidth anymore, everything happens on that chip in such a case :)
>
>???
>
>Perhaps I'm highly confused, but I thought vector units operated on
>DATA, and that that data had to live in MEMORY.
>
>Memory, in turn, has some fairly stringent limits on the rate at which
>it can be accessed -- very definitely finite in terms of both latency
>and bandwidth.  Furthermore, for streaming vector-like operations in at
>least NORMAL PC-like architectures (as opposed to supercomputer
>architectures where a Lot Of Money (LOM) is spent feeding the processor
>I would expect memory bandwidth to be the fundamental bottleneck in
>streaming/vector operations.
>
>To be more explicit -- if you run stream on most reasonably modern
>systems, stream "copy" takes a time that is very comparable to to stream
>"scale".  For example, on a P4 I have handy:
>
>rgb at lucifer|B:1100>./benchmaster -t 2 -s 1000000 -i 1 -n 10
>avg full = 1.497222e+07 min = 1.492613e+07  max = 1.513106e+07
>Content-Length: 1067
>
><?xml version="1.0"?>
><benchml>
>  <version>Benchmaster 1.1.2</version>
>  <hostinfo>
>    <hostname>lucifer</hostname>
>    <vendor_id> GenuineIntel</vendor_id>
>    <CPU name> Intel(R) Pentium(R) 4 CPU 1.80GHz</CPU name>
>    <CPU clock units="Mhz"> 1804.520</CPU clock>
>    <l2cache units="KB"> 512 KB</l2cache>
>    <memtotal units="KB">515840</memtotal>
>    <memfree units="KB">105536</memfree>
>    <nanotimer>cpu cycle counter nanotimer</nanotimer>
>    <nanotimer_granularity units="nsec">98.158</nanotimer_granularity>
>  </hostinfo>
>  <benchmark>
>    <name>stream copy</name>
>    <command>./benchmaster</command>
>    <args>-t 2 -s 1000000 -i 1 -n 10</args>
>    <description>d[i] = a[i] (standard is -s 1000000 -i 1 -n
>10)</description>
>    <iterations>1</iterations>
>    <size>1000000</size>
>    <stride>1</stride>
>    <time units="nsec">1.50e+01</time>
>    <time_stddev units="nsec">1.80e-02</time_stddev>
>    <min_time units="nsec">1.49e+01</min_time>
>    <max_time units="nsec">1.51e+01</max_time>
>    <rate units="10e+6">1.07e+03</rate>
>  </benchmark>
></benchml>
>
>
>rgb at lucifer|B:1104>./benchmaster -t 3 -s 1000000 -i 1 -n 10
>avg full = 1.489518e+07 min = 1.486376e+07  max = 1.494210e+07
>Content-Length: 1075
>
><?xml version="1.0"?>
><benchml>
>  <version>Benchmaster 1.1.2</version>
>  <hostinfo>
>    <hostname>lucifer</hostname>
>    <vendor_id> GenuineIntel</vendor_id>
>    <CPU name> Intel(R) Pentium(R) 4 CPU 1.80GHz</CPU name>
>    <CPU clock units="Mhz"> 1804.520</CPU clock>
>    <l2cache units="KB"> 512 KB</l2cache>
>    <memtotal units="KB">515840</memtotal>
>    <memfree units="KB">105536</memfree>
>    <nanotimer>cpu cycle counter nanotimer</nanotimer>
>    <nanotimer_granularity units="nsec">98.118</nanotimer_granularity>
>  </hostinfo>
>  <benchmark>
>    <name>stream scale</name>
>    <command>./benchmaster</command>
>    <args>-t 3 -s 1000000 -i 1 -n 10</args>
>    <description>d[i] = xtest*d[i] (standard is -s 1000000 -i 1 -n
>10)</description>
>    <iterations>1</iterations>
>    <size>1000000</size>
>    <stride>1</stride>
>    <time units="nsec">1.49e+01</time>
>    <time_stddev units="nsec">8.47e-03</time_stddev>
>    <min_time units="nsec">1.49e+01</min_time>
>    <max_time units="nsec">1.49e+01</max_time>
>    <rate units="10e+6">1.07e+03</rate>
>  </benchmark>
></benchml>
>
>Note that these times are already identical.  The "scale" operation is
>done in parallel with I/O and takes "zero time".  Clearly the
>memory-to-CPU I/O is the bottleneck, not the processor, which is
>"infinitely fast" as far as this operation is concerned.
>
>Comparing triad:
>
>
>
>rgb at lucifer|B:1099>./benchmaster -t 5 -s 1000000 -i 1 -n 10
>avg full = 1.982203e+07 min = 1.980236e+07  max = 1.985822e+07
>Content-Length: 1081
>
><?xml version="1.0"?>
><benchml>
>  <version>Benchmaster 1.1.2</version>
>  <hostinfo>
>    <hostname>lucifer</hostname>
>    <vendor_id> GenuineIntel</vendor_id>
>    <CPU name> Intel(R) Pentium(R) 4 CPU 1.80GHz</CPU name>
>    <CPU clock units="Mhz"> 1804.520</CPU clock>
>    <l2cache units="KB"> 512 KB</l2cache>
>    <memtotal units="KB">515840</memtotal>
>    <memfree units="KB">105144</memfree>
>    <nanotimer>cpu cycle counter nanotimer</nanotimer>
>    <nanotimer_granularity units="nsec">98.244</nanotimer_granularity>
>  </hostinfo>
>  <benchmark>
>    <name>stream triad</name>
>    <command>./benchmaster</command>
>    <args>-t 5 -s 1000000 -i 1 -n 10</args>
>    <description>d[i] = a[i] + xtest*b[i] (standard is -s 1000000 -i 1
>-n 10)</description>
>    <iterations>1</iterations>
>    <size>1000000</size>
>    <stride>1</stride>
>    <time units="nsec">1.98e+01</time>
>    <time_stddev units="nsec">6.08e-03</time_stddev>
>    <min_time units="nsec">1.98e+01</min_time>
>    <max_time units="nsec">1.99e+01</max_time>
>    <rate units="10e+6">1.21e+03</rate>
>  </benchmark>
></benchml>
>
>shows a slowdown of roughly 1/3 relative to copy (2 ns per loop pass
>instead of 1.5 ns per loop pass). Again, on-chip parallelism and the
>efficiency of the memory interface are clearly exhibited.
>
>In other words, if you stuck a "1 TFLOP" vector processor into this
>particular MEMORY architecture and ran stream, I wouldn't expect to see
>much speedup.  The bulk of the time required for the actual numerical
>operations is already hidden by the memory bottleneck and design
>parallelism.  It isn't clear how much a larger cache would help, either,
>on streaming code that doesn't reuse any memory location -- there comes
>a point of diminishing returns where you're stuck at fundamental
>bandwidth again, waiting for the cache to refill, when the CPU is much
>faster.  I tend to think of large caches as being useful to people with
>lots of contexts to juggle more than a necessary advantage to people
>running LARGE vector operations, although they are an obvious advantage
>to folks whose vectors can fit into cache:-)
>
>> "If you were plowing a field, which would you rather use? Two strong oxen
>> or 1024 chickens?"
>
>I'd take the chickens any day, if each chicken could plow 1/1024th of
>the field faster than the two oxen could do the whole thing.
>
>And they only cost chickenfeed...;-)

The fundamental problem is you need a $2 million network for your 1024
chickens.

All i need with the virtual 1 tflop processor is some of that XDR memory,
add a few memory banks at the mainboard which are each independantly
connected to one of the 4 sides of the chip. 

Even if that XDR rambuss grossely expensive at say $1000 for 2 gigabyte, 

then i put in 4 dimms of it at each independant memory controller of the
chip, 
delivering a total bandwidth of 100 gigabyte a second.

Even a medium engineer can add on die memory controllers to a chip.

The extreme expensive mainboard i buy for $2500 and i feel ripped off.
The extreme expensive cell processor i buy for $3000 and i feel ripped off,

But even then at a total price of say $10k the system is eating your $3
million 1024 chicken system alive.

See my point?

Whatever price the 1024 chickens have, your network costs several million.

A 12288 chicken system from IBM costs to be precise 8 million euro,
delivering 27.5 tflop. Thanks to the network it is not a chicken system but
europese fastest supercomputer.

By the way, my chessprogram doesn't output 2 bytes, but it outputs 4 bytes
each few minutes. 

A chess move.

If you have a processor that can run it, just outputting 4 bytes each few
minutes, at a speed equal to a 1 Thz opteron processor, then i can sell
100000 machines of that hands down if its price is that of a normal PC.

At a 1 thz machine it will output d2d4 anyway the first move. In
correspondence notation 'd2' is the field 42 and 'd4' is 44.

So it's output is 42 anyway.

Vincent

>   rgb
>
>>   Seymour Cray
>> 
>> Vincent
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>> 
>
>-- 
>Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
>Duke University Dept. of Physics, Box 90305
>Durham, N.C. 27708-0305
>Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
>
>
>
>



More information about the Beowulf mailing list