[Beowulf] Cell
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduThu Apr 28 07:27:10 PDT 2005
- Previous message: [Beowulf] Cell
- Next message: [Beowulf] Cell
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, 28 Apr 2005, Vincent Diepeveen wrote: > >> "If you were plowing a field, which would you rather use? Two strong oxen > >> or 1024 chickens?" > > > >I'd take the chickens any day, if each chicken could plow 1/1024th of > >the field faster than the two oxen could do the whole thing. > > > >And they only cost chickenfeed...;-) > > The fundamental problem is you need a $2 million network for your 1024 > chickens. > > All i need with the virtual 1 tflop processor is some of that XDR memory, > add a few memory banks at the mainboard which are each independantly > connected to one of the 4 sides of the chip. Ah, Vincent, I begin to see your point. We are (not surprisingly, I suppose) talking about two different things. I thought that your remarks were directed at the discussion on VECTOR PROCESSORS. Vector processors are speciality processors designed to do one thing only, but do it very well. Take in one or more vectors of data on the right, perform any one of several relatively simple/linear transformations on each element (generally the same transformation or set of transformations to every element) and spit out a vector of results on the left. Classic examples of vector processors are e.g. digital signal processors or graphics processing units -- both take a fairly predictable stream of vectorized data and wreak a fairly standard set of transformations upon it. See: ttp://www.epcc.ed.ac.uk/HPCinfo/glossary.html Because they don't have to waste all that processors space on the implementation of general purpose instructions, they can stack up nice little floating point pipelines much deeper than a general purpose CPU, and run many M successive steps of a vector transformation at once, where it is applying instruction 1 to the Mth data object at the same time it is applying instruction M to the 1st data object in a stream of N objects. If N is much much bigger than M, and if the processor doesn't have to do annoying things like compare each result to something and maybe branch and flush and refill the entire pipeline, this can reduce the time required per instruction to something like 1/Mth of the clock (presuming that the logic can retire one instruction per clock cycle at each level). Note that general purpose CPUs often have pipelines these days and can manage "vector instructions" but the pipelines are not very deep -- typically what, 2-3 instructions? -- and often support only a few speciality floating point operations. This is enough to enable relatively fast simple transforms that involve e.g. multiply-adds (a very common instruction pair in linear algebra and one of the things studied by stream). Thus a perrennial question discussed on the list is whether or not beowulfs equipped with attached DSP or GPUs to do vector instructions is a "win" in terms of performance for people with the right code. This often appears in the context of a discussion of whether or not to build a cluster out of the latest round of game boxes (e.g. gamecubes, playstations, whatever). This is because these boxes, these days, have enough general purpose power to be able to run a simple operating system including a network, but have dedicated graphics processors so that they can do their game graphics transformations very quickly. There have been "clusters" with vector processors attached to scalar processors sold as (usually SIMD) supercomputers in years past, e.g. Thinking Machines CM series -- consensus among my friends who used them are that they were a godawful pain to program but sure, quite speedy for the right code. I think that the accuracy of this description persists to this very day for this sort of architecture. Now, I was pointing out the following simple fact. Memory speeds are MUCH MUCH SLOWER than CPU speeds, and for that matter, pretty much always have been. Furthermore, faster memory (for any given memory generation) costs MUCH more than slower memory. That is WHY there is a memory hierarchy on EVERY system ever built. It runs something like the following simplified picture with times that are best of my recollection and may not be current (they very definitely change as systems evolve, although more slowly than Moore's Law advances raw CPU speed, creating a widening gap): register memory: on CPU -- fastest, access in ~1 clock as part of execution cycle L1 cache: static RAM, on CPU -- very fast, access in a few clocks (<=1 ns?) L2 cache: static RAM, also "on" most modern CPUs but hierarchically lives between L1 and either L3 on the few systems that have a third cache layer or more commonly DRAM -- main memory. access in ~1-5 ns. DRAM: main dynamic memory on most computers. A typical modern design pays a relatively high latency hit for reading at a new address (say 40 ns) followed by much lower hits for streaming sequential reads beginning at the first location -- IIRC around 10 ns each, but this changes fast enough that I wouldn't be surprised if my numbers are obsolete. How much data gets clocked in per cycle depends on many things such as the width of the datapath. Virtual memory: swap, NUMA, non-NUMA network partitioned memory. Typically BOTH high latency AND relatively low bandwidth, although NUMA SMP designs can be RELATIVELY fast since that's the whole point. Numbers can range from about an order of magnitude slower than DRAM to many orders of magnitude slower, especially for latency-bounded access to disk-based swap where disk latencies are typically in the MILLIsecond range. If you are running code that "needs" VM to fit, you're either dead (it can be viewed as a code design "bug", as I think Mark pointed out) or you have to work very hard organizing your code to get "acceptable" speeds Note that with the exception of register memory all memory hits on a system SLOW the system relative to its theoretical instruction execution rate. Instructions are typically retired in 1/2-2 clock cycles (depending on pipelining and parallelism) which is a sub-ns time on modern CPUs. Accessing memory outside of L1 takes many clocks, although lots of things work to make those accesses run in parallel with execution so that the CPU has to wait a minimal amount of time to get data/code into its working memory. The whole point of having the memory hierarchy is that if your program, or data, have any significant degree of locality, having the cache hides the slowness of the DRAM subsystem, which is a factor of 30-100 slower than operational register memory on the CPU itself. If your code is 90% local at any given layer of cache, the average time required to run your code is something like 0.9 T_s + 0.1 T_l where T_s is the short time to get something out of (say) L_2 vs the long time T_l to get something out of DRAM. In ns, this could be something like 0.9 1 + 0.1 10 = 2 ns average access time instead of 10 ns average access time to get the same data straight out of DRAM without caching. The effect is especially great when looking at random access patterns, where one pays the 40 ns latency hit in DRAM. benchmaster has a "shuffled" mode that lets you look at random read, random write, random read/write rates to big blocks of memory, and memory performance really goes straight to hell compared to streaming for cache-defeating patterns. I do mean hell: rgb at lucifer|B:1115>./benchmaster -t 6 -s 1000000 -r avg full = 1.508151e+08 min = 1.502259e+08 max = 1.519108e+08 Content-Length: 1065 <?xml version="1.0"?> <benchml> <version>Benchmaster 1.1.2</version> <hostinfo> <hostname>lucifer</hostname> <vendor_id> GenuineIntel</vendor_id> <CPU name> Intel(R) Pentium(R) 4 CPU 1.80GHz</CPU name> <CPU clock units="Mhz"> 1804.520</CPU clock> <l2cache units="KB"> 512 KB</l2cache> <memtotal units="KB">515840</memtotal> <memfree units="KB">104832</memfree> <nanotimer>cpu cycle counter nanotimer</nanotimer> <nanotimer_granularity units="nsec">99.144</nanotimer_granularity> </hostinfo> <benchmark> <name>memory read/write</name> <command>./benchmaster</command> <args>-t 6 -s 1000000 -r</args> <description>Reads, then writes 4 byte integer vector</description> <iterations>2</iterations> <size>1000000</size> <stride>shuffled</stride> <time units="nsec">1.48e+02</time> <time_stddev units="nsec">3.97e-02</time_stddev> <min_time units="nsec">1.48e+02</min_time> <max_time units="nsec">1.49e+02</max_time> <rate units="10e+6">6.74e+00</rate> </benchmark> </benchml> Note the >>148 nsec<< access time required to read, then write, a single integer into/out of a shuffled/random memory location in a vector much larger than cache. This is much LARGER than you'd naively expect just from the properties of DDR on a 500 FSB system. Compare it to doing the SAME TASK (complete to the indirected addressing, but with an unshuffled addressing vector): rgb at lucifer|B:1116>./benchmaster -t 6 -s 1000000 avg full = 8.291473e+06 min = 8.127370e+06 max = 8.964347e+06 Content-Length: 1055 <?xml version="1.0"?> <benchml> <version>Benchmaster 1.1.2</version> <hostinfo> <hostname>lucifer</hostname> <vendor_id> GenuineIntel</vendor_id> <CPU name> Intel(R) Pentium(R) 4 CPU 1.80GHz</CPU name> <CPU clock units="Mhz"> 1804.520</CPU clock> <l2cache units="KB"> 512 KB</l2cache> <memtotal units="KB">515840</memtotal> <memfree units="KB">104832</memfree> <nanotimer>cpu cycle counter nanotimer</nanotimer> <nanotimer_granularity units="nsec">99.920</nanotimer_granularity> </hostinfo> <benchmark> <name>memory read/write</name> <command>./benchmaster</command> <args>-t 6 -s 1000000</args> <description>Reads, then writes 4 byte integer vector</description> <iterations>2</iterations> <size>1000000</size> <stride>1</stride> <time units="nsec">5.81e+00</time> <time_stddev units="nsec">2.66e-02</time_stddev> <min_time units="nsec">5.65e+00</min_time> <max_time units="nsec">6.48e+00</max_time> <rate units="10e+6">1.72e+02</rate> </benchmark> </benchml> Down to 5 nsec per integer -- a more efficient addressing loop would probably drop this a hair. The memory is no faster, but now cache "works" for a factor of 30 speedup. Go down still further in size: <?xml version="1.0"?> <benchml> <version>Benchmaster 1.1.2</version> <hostinfo> <hostname>lucifer</hostname> <vendor_id> GenuineIntel</vendor_id> <CPU name> Intel(R) Pentium(R) 4 CPU 1.80GHz</CPU name> <CPU clock units="Mhz"> 1804.520</CPU clock> <l2cache units="KB"> 512 KB</l2cache> <memtotal units="KB">515840</memtotal> <memfree units="KB">104832</memfree> <nanotimer>cpu cycle counter nanotimer</nanotimer> <nanotimer_granularity units="nsec">99.900</nanotimer_granularity> </hostinfo> <benchmark> <name>memory read/write</name> <command>./benchmaster</command> <args>-t 6 -s 1000</args> <description>Reads, then writes 4 byte integer vector</description> <iterations>256</iterations> <size>1000</size> <stride>1</stride> <time units="nsec">3.47e+00</time> <time_stddev units="nsec">7.70e-03</time_stddev> <min_time units="nsec">3.41e+00</min_time> <max_time units="nsec">3.62e+00</max_time> <rate units="10e+6">2.88e+02</rate> </benchmark> </benchml> into RW into L1, and you see times approaching 50x faster, where again some of this is double-indexing overhead. Even here random access forces one to give back a factor of 10 in speed -- it is faster to access a streaming int vector out of DRAM than it is to access data out of L1 in random order. The CPU really "wants" data access to be sequential and local, and makes you pay for it in time and idle cycles when it is not. The net result of all of this is that a LOT of what you perceive of as "raw CPU speed" on any given computer is an illusion -- smoke and mirrors -- perpetrated by a very clever memory hierarchy and will vanish instantly for the wrong kind of cache defeating work. There are all sorts of lovely human metaphors -- some work you can do "in your head", some requires information in papers on your desk that you have to put into your head, some involves papers in filing cabinets in your office (and indeed all your tasks START by getting out some particular file or set of files), some of those files have to get shipped in from the main office across town and all your results ultimately have to be shipped back to the main office when you're done. Same thing. You can be a super-genius and do the "brain work" for each file in zero time and only be a little faster than the plodding Joe next door if it takes both of you longer to get the files from across town into your local file cabinets, get the files from there onto your desks, and get the data from the files and into your respective heads than it does for either one of you to solve the actual problem once it is there. This is not always the case, of course, for HPC and computer tasks in general any more than it is for humans. Humans might WELL take on a task and require days or weeks to solve it from their "desktop" compared to minutes to hours to obtain the task and file away results. Plenty of HPC tasks involve things like computing trancendental functions or doing division inside the primary loop, tasks that take a LONG time and give the memory subsystems plenty of time to keep up with the CPU. So one invents a term to differentiate task types: "memory bound" tasks are ones where memory I/O is the rate-determining bottleneck that limits performance. "CPU bound" tasks are ones whose completion time scales with the inverse of CPU clock, more or less independent of the particular memory hierarchy involved. On parallel architectures, "interprocessor communications (IPC) bound" is an extension of the memory bound idea to distributed tasks; the IPC bottleneck has profound effects on the parallel SCALING of tasks across a set of nodes. >From what you are saying (in this and previous communications) your particular application, computer chess, is in the relatively happy regime where the application is CPU bound, hence your somewhat cavalier attitude to the VERY DIFFICULT AND EXPENSIVE TASK of designing the CPU-memory interface on modern computers. I don't think you appreciate how serious a problem this is, or how much serious money e.g. Intel or AMD spend on solving it. Whether or not you are memory bound can be determined easily enough from the task scaling -- if it takes (say) 4 seconds to determine its next move from some given configuration on a 500 MHz P6-family CPU, 2 seconds on a 1 GHz P6-family CPU, and 1 second on a 2 GHz P6-family CPU (independent of the amount or kind of memory, and probably even independent of the amount of L2 cache) then yes, it is probably CPU bound. Or use minutes or hours if appropriate. Guessing that a chess program spends most of its time doing character or integer tasks and comparisons -- processing large integer-indexed outcome trees of integer-sized binary encoded information with boolean instructions and indexing arithmetic -- one would expect the ENTIRE discussion on vector processors to be pretty much irrelevant to your application, as the VPs in question only are useful for doing streaming floating point arithmetic, and generally only for a fairly specific subset of numerical instructions and fairly specific data sizes at that. It isn't clear that they would be at ALL useful, or even usable, for/by your program. If, OTOH, the program perhaps STARTS by scaling with clock for relatively simple board configurations but then grinds to a metaphorical halt and no longer scales with CPU clock for more complex ones; it is quite possible that it starts CPU bound (for configurations that fit into L2 or are sufficiently local in memory that the architecture's cache fill algorithms "work") and then crosses over as the program has to look further and further away in memory (and in more and more random ways) to work out the outcome trees. In this case you'd observe a very definite cache effect -- larger processors wouldn't necessarily run instructions faster at any given clock but they WOULD run out of cache more often and avoid that nasty random access latency hit and hence would finish complex trees faster. You'd likely also observe a fairly straightforward dependence on memory type and speed, given that you're running on systems as old as 500 MHz PIII's, with their relatively slow FSB and relatively inefficient MMUs. Faster memory would produce immediate improvement independent of CPU speed, or on top of it as I don't think you can get 500 MHz systems with faster memory but rather have to get faster systems with faster memory. It wouldn't be surprising for those faster systems to relatively UNDERPERFORM CPU clock speed increases if your task IS partly memory bound, because of the aforementioned memory speed gap. Bleeding edge CPU clock speeds are about a factor of 8 faster than a 500 MHz PIII, but DRAM memory speeds have not gone up this much. > Even if that XDR rambuss grossely expensive at say $1000 for 2 gigabyte, > > then i put in 4 dimms of it at each independant memory controller of the > chip, > delivering a total bandwidth of 100 gigabyte a second. > > Even a medium engineer can add on die memory controllers to a chip. > > The extreme expensive mainboard i buy for $2500 and i feel ripped off. > The extreme expensive cell processor i buy for $3000 and i feel ripped off, > > But even then at a total price of say $10k the system is eating your $3 > million 1024 chicken system alive. > > See my point? > > Whatever price the 1024 chickens have, your network costs several million. No, no, no. First, the "oxen" you are proposing as an alternative simply doesn't exist -- we'd ALL love to have cheap 1 THz processors, and in lessee (log_2(25) = 4.64)x 2 = 9.3 years we might have them, if Moore's law holds out that long. If they did exist, we'd yoke them together and call them chickens and build QHz cluster out of them so we could finally count the number of angels that can dance on the head of a pin (the answer is doubtless 42, but we have to be sure:-). Second, many a 1024 chicken cluster is very productive indeed with a 100BT network that costs maybe $10/chic-- I mean node;-) Other clusters can get by with gigabit ethernet at perhaps $100/node. Then sure, there are high end clusters that spend as much or more on the network per node as they do on the nodes themselves, and intermediate RDMA and TCP/IP managing gigE NICs that come in somewhere in between. All of these options make SENSE and are COST-BENEFIT OPTIMAL for various classes of problems -- the 100 BT solution for all embarrassingly parallel problems and quite a few "coarse grained" parallel problems (e.g. rendering, computing nifty mandelbrot set pictures, maybe even chess), the successively faster network solutions for parallel problems with IPCs that are increasingly bandwith and/or latency bound. The idea is to pick a cluster architecture for which your problem is UNbound by IPCs out to some desireable degree of speedup, in the most affordable way, not to build the world's biggest cluster as a marketing ploy (however often the latter is what actually happens). It is worth pointing out NOW that it is well known that there are whole classes of problem that do NOT parallelize efficiently on any available hardware including even the most advanced networks. A cluster or supercomputer of ANY sort do not always make sense and there is a whole computer science of the scaling of the diminishing returns. Network I/O is still hovering around 1 microsecond latency and bandwidths in the range of billions of bits per second. The former is order(s) of magnitude slower than the slowest DRAM; the latter is order(s) of magnitude slower than the slowest DRAM. You cannot just refill registers on your CPU from a memory location on a remote system anywhere nearly as fast as you can refill it from your own DRAM, and in many cases you have to FIRST move memory from DRAM to DRAM and THEN let operations move it into L2, L1, registers and so on, so it is strictly an ADDITIONAL hit timewise. For single-threaded tasks performing serialized instructions, there is no advantage to using a cluster unless the task is so big that it won't fit into local memory, in which case a network-cluster-based VM extension may well be the fastest possible solution that lets you run at all (whether or not it is "fast enough" to do what you need to do). NOTE WELL! Speedup, slowdown, and what kind of network suffices to represent a "solution" to any given problem depends STRONGLY on task organization. You cannot just say "my task doesn't run any faster on a cluster". You may have to COMPLETELY REORGANIZE AND REWRITE your task so it WILL run faster on a cluster, and the reorganization and cluster design it will run on are INTEGRATED in the final solution so one of the former will by no means fit all of the latter. To make the discussion concrete, let's address two tasks. We just heard from somebody that discovered the simplest possible form of parallelism -- instead of trying to partition an entire computation and all of its linear algebra components to run efficiently on any kind of cluster (and most modern "real supercomputers" are just clusters with custom memory/cpu interconnects -- an internal "memory" if you like) he just ran multiple instances of the serialized job. Why not? He had lots of runs to make, and doing them at the same time on N nodes gets them done in 1/N of the time of a single node! This is the IDEAL! The more "like" this you can make your task, or any significant PART of your task, the better your parallel scaling will be and the greater the parallel speedup you will observe. Then lets think about YOUR task, parsing out chess moves. So far, it sounds like the code that does it is pretty much strictly serialized. Strictly parallelizing the serialized portions within any given tree may not be particularly efficient, as it may take longer to communicate state at each succeeding level than the task would take to run if it didn't pause to split off and communicate each tree node to a real node. What you'd like to do is find a task reorganization that makes the entire task partition, so "parts" of the computation can be advanced independently on completely different systems after being given relatively little starting information. In the case of chess, I'd guess that one trivial parallelization is to split up all the pieces that CAN move in any given configuration so that each node solves the tree of outcomes for just moving that one piece. Prune the results, and when any given piece dead-ends (and its node becomes free again) split off single-moove components of the productive trees recursively. Arrange it (if possible) so that communications are rare per node compared to node computation and you'll have a winner in terms of parallel speedup. This presumes that it is possible to make local decisions about the quality of positions that are sufficient to accept or reject them with relatively small IPCs (communicating the general range of "best position so far" across the cluster). I'm guessing that MOST possible initial moves at any given level lead to positions that are "obviously bad" and one only needs to pursue them far enough to overcome possible optimization humps that separate these "bad" positions from a really really good one accessible in only this way. Which tempts me to digress upon GA's, simulated annealing, and discrete or continuous optimization theory in general, but I will manfully resist...:-) Or something -- I haven't studied the problem and you have, but IF you can find a partitioning that REMAINS CPU bound or only weakly memory bound on a cluster and involves only relatively "small" IPCs in order to set up and manage the partitioning, then a cluster becomes a good way to get things to speed up a lot for only a little money. The more data you have to exchange in the partitioning, the more you have to spend (relatively speaking) on network compared to CPU, and the more "IPC bound" the parallel task scaling will be. > A 12288 chicken system from IBM costs to be precise 8 million euro, > delivering 27.5 tflop. Thanks to the network it is not a chicken system but > europese fastest supercomputer. Sure, sure, sure. But THIS list is as much about a 2.5 K$ 8-node cluster as it is about an 8 ME 12288-node cluster. "Europe's Fastest Supercomputer" with all those nodes and that really expensive network isn't worth a damn to somebody running a serialized program, and it may not be worth much to somebody running a fine grained synchronous parallel program whose parallel scaling peaks at 32 nodes. It will be MOST useful to people who run embarrassingly parallel programs, and the main point that I can see of having such an agressively large size is that it will let lots of people run much smaller programs at the same time, or let somebody with a 32-node task can run 384 instances of it at once, in an embarrassingly parallel way. And of course it will get a really high linpack score and hence end up near the top of the Top 500 list, which is good marketing and almost certainly the REAL point of the whole thing given the paucity of problems that might actually use the entire resource constructively all at once. > By the way, my chessprogram doesn't output 2 bytes, but it outputs 4 bytes > each few minutes. > > A chess move. > > If you have a processor that can run it, just outputting 4 bytes each few > minutes, at a speed equal to a 1 Thz opteron processor, then i can sell > 100000 machines of that hands down if its price is that of a normal PC. > > At a 1 thz machine it will output d2d4 anyway the first move. In > correspondence notation 'd2' is the field 42 and 'd4' is 44. > > So it's output is 42 anyway. Right, but what if you could partition the "next move" problem so that it ran on a $2500 8 processor cluster where each processor provided you with (say) 3 GHz of processing power today (instead of an entirely mythical 1 THz of single processor power). That gives you 24 GHz of processing power for about what a high end gaming system costs from Dell, and about 0.1 THz (100 GHz) for only $10K and 32 nodes. Even if you end up doubling the per-node cost and get better/faster nodes or a better network after you finish analyzing the problem decomposition, it is still pretty cheap FOR WHAT YOU WANT TO ACCOMPLISH which is a nontrivial computational task. My final advice is to think carefully about problem decomposition and recursion. AFAICT, chess in general has a natural decomposition and recursion structure, where although the space of possible moves grows superexponentially, it also prunes almost as fast so that slow old human brains reject most possible move sequences almost before they start to really work. If you can figure out any way at all to do "the same thing" on a cluster, to carry out the pruning based on local or almost-local information, you should be able to realize linear or near-linear speedup on very pedestrian cluster designs, ones with CHEAP networks and CHEAP nodes. It seems like a fun problem, even. rgb > > Vincent > > > rgb > > > >> Seymour Cray > >> > >> Vincent > >> _______________________________________________ > >> Beowulf mailing list, Beowulf at beowulf.org > >> To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > >> > > > >-- > >Robert G. Brown http://www.phy.duke.edu/~rgb/ > >Duke University Dept. of Physics, Box 90305 > >Durham, N.C. 27708-0305 > >Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu > > > > > > > > > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: [Beowulf] Cell
- Next message: [Beowulf] Cell
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
