[Beowulf] Cell
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduWed Apr 27 14:41:56 PDT 2005
- Previous message: [Beowulf] Cell
- Next message: [Beowulf] Cell
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> If you can deliver 1 processor that can do 1 tflop, there is no need for > bandwidth anymore, everything happens on that chip in such a case :) ??? Perhaps I'm highly confused, but I thought vector units operated on DATA, and that that data had to live in MEMORY. Memory, in turn, has some fairly stringent limits on the rate at which it can be accessed -- very definitely finite in terms of both latency and bandwidth. Furthermore, for streaming vector-like operations in at least NORMAL PC-like architectures (as opposed to supercomputer architectures where a Lot Of Money (LOM) is spent feeding the processor I would expect memory bandwidth to be the fundamental bottleneck in streaming/vector operations. To be more explicit -- if you run stream on most reasonably modern systems, stream "copy" takes a time that is very comparable to to stream "scale". For example, on a P4 I have handy: rgb at lucifer|B:1100>./benchmaster -t 2 -s 1000000 -i 1 -n 10 avg full = 1.497222e+07 min = 1.492613e+07 max = 1.513106e+07 Content-Length: 1067 <?xml version="1.0"?> <benchml> <version>Benchmaster 1.1.2</version> <hostinfo> <hostname>lucifer</hostname> <vendor_id> GenuineIntel</vendor_id> <CPU name> Intel(R) Pentium(R) 4 CPU 1.80GHz</CPU name> <CPU clock units="Mhz"> 1804.520</CPU clock> <l2cache units="KB"> 512 KB</l2cache> <memtotal units="KB">515840</memtotal> <memfree units="KB">105536</memfree> <nanotimer>cpu cycle counter nanotimer</nanotimer> <nanotimer_granularity units="nsec">98.158</nanotimer_granularity> </hostinfo> <benchmark> <name>stream copy</name> <command>./benchmaster</command> <args>-t 2 -s 1000000 -i 1 -n 10</args> <description>d[i] = a[i] (standard is -s 1000000 -i 1 -n 10)</description> <iterations>1</iterations> <size>1000000</size> <stride>1</stride> <time units="nsec">1.50e+01</time> <time_stddev units="nsec">1.80e-02</time_stddev> <min_time units="nsec">1.49e+01</min_time> <max_time units="nsec">1.51e+01</max_time> <rate units="10e+6">1.07e+03</rate> </benchmark> </benchml> rgb at lucifer|B:1104>./benchmaster -t 3 -s 1000000 -i 1 -n 10 avg full = 1.489518e+07 min = 1.486376e+07 max = 1.494210e+07 Content-Length: 1075 <?xml version="1.0"?> <benchml> <version>Benchmaster 1.1.2</version> <hostinfo> <hostname>lucifer</hostname> <vendor_id> GenuineIntel</vendor_id> <CPU name> Intel(R) Pentium(R) 4 CPU 1.80GHz</CPU name> <CPU clock units="Mhz"> 1804.520</CPU clock> <l2cache units="KB"> 512 KB</l2cache> <memtotal units="KB">515840</memtotal> <memfree units="KB">105536</memfree> <nanotimer>cpu cycle counter nanotimer</nanotimer> <nanotimer_granularity units="nsec">98.118</nanotimer_granularity> </hostinfo> <benchmark> <name>stream scale</name> <command>./benchmaster</command> <args>-t 3 -s 1000000 -i 1 -n 10</args> <description>d[i] = xtest*d[i] (standard is -s 1000000 -i 1 -n 10)</description> <iterations>1</iterations> <size>1000000</size> <stride>1</stride> <time units="nsec">1.49e+01</time> <time_stddev units="nsec">8.47e-03</time_stddev> <min_time units="nsec">1.49e+01</min_time> <max_time units="nsec">1.49e+01</max_time> <rate units="10e+6">1.07e+03</rate> </benchmark> </benchml> Note that these times are already identical. The "scale" operation is done in parallel with I/O and takes "zero time". Clearly the memory-to-CPU I/O is the bottleneck, not the processor, which is "infinitely fast" as far as this operation is concerned. Comparing triad: rgb at lucifer|B:1099>./benchmaster -t 5 -s 1000000 -i 1 -n 10 avg full = 1.982203e+07 min = 1.980236e+07 max = 1.985822e+07 Content-Length: 1081 <?xml version="1.0"?> <benchml> <version>Benchmaster 1.1.2</version> <hostinfo> <hostname>lucifer</hostname> <vendor_id> GenuineIntel</vendor_id> <CPU name> Intel(R) Pentium(R) 4 CPU 1.80GHz</CPU name> <CPU clock units="Mhz"> 1804.520</CPU clock> <l2cache units="KB"> 512 KB</l2cache> <memtotal units="KB">515840</memtotal> <memfree units="KB">105144</memfree> <nanotimer>cpu cycle counter nanotimer</nanotimer> <nanotimer_granularity units="nsec">98.244</nanotimer_granularity> </hostinfo> <benchmark> <name>stream triad</name> <command>./benchmaster</command> <args>-t 5 -s 1000000 -i 1 -n 10</args> <description>d[i] = a[i] + xtest*b[i] (standard is -s 1000000 -i 1 -n 10)</description> <iterations>1</iterations> <size>1000000</size> <stride>1</stride> <time units="nsec">1.98e+01</time> <time_stddev units="nsec">6.08e-03</time_stddev> <min_time units="nsec">1.98e+01</min_time> <max_time units="nsec">1.99e+01</max_time> <rate units="10e+6">1.21e+03</rate> </benchmark> </benchml> shows a slowdown of roughly 1/3 relative to copy (2 ns per loop pass instead of 1.5 ns per loop pass). Again, on-chip parallelism and the efficiency of the memory interface are clearly exhibited. In other words, if you stuck a "1 TFLOP" vector processor into this particular MEMORY architecture and ran stream, I wouldn't expect to see much speedup. The bulk of the time required for the actual numerical operations is already hidden by the memory bottleneck and design parallelism. It isn't clear how much a larger cache would help, either, on streaming code that doesn't reuse any memory location -- there comes a point of diminishing returns where you're stuck at fundamental bandwidth again, waiting for the cache to refill, when the CPU is much faster. I tend to think of large caches as being useful to people with lots of contexts to juggle more than a necessary advantage to people running LARGE vector operations, although they are an obvious advantage to folks whose vectors can fit into cache:-) > "If you were plowing a field, which would you rather use? Two strong oxen > or 1024 chickens?" I'd take the chickens any day, if each chicken could plow 1/1024th of the field faster than the two oxen could do the whole thing. And they only cost chickenfeed...;-) rgb > Seymour Cray > > Vincent > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: [Beowulf] Cell
- Next message: [Beowulf] Cell
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
