<div>
Completely agree. I would highly recommend reading this blog post by James Hamilton (it's not HPC but the principles still apply): <a href="http://perspectives.mvdirona.com/2012/02/26/ObservationsOnErrorsCorrectionsTrustOfDependentSystems.aspx">http://perspectives.mvdirona.com/2012/02/26/ObservationsOnErrorsCorrectionsTrustOfDependentSystems.aspx</a>. The key quote</div><div><br></div><div>"This incident reminds us of the importance of never trusting anything from any component in a multi-component system. Checksum every data block and have well-designed, and well-tested failure modes for even unlikely events. Rather than have complex recovery logic for the near infinite number of faults possible, have simple, brute-force recovery paths that you can use broadly and test frequently. Remember that all hardware, all firmware, and all software have faults and introduce errors. Don’t trust anyone or anything. Have test systems that bit flips and corrupts and ensure the production system can operate through these faults – at scale, rare events are amazingly common."</div><div><br></div><div>So ECC is necessary, but not sufficient. Being able to test is critical.</div>
<div><div><br></div><div>Deepak</div></div>
<p style="color: #A0A0A8;">On Sunday, November 4, 2012 at 10:06 AM, Jörg Saßmannshausen wrote:</p>
<blockquote type="cite" style="border-left-style:solid;border-width:1px;margin-left:0px;padding-left:10px;">
<span><div><div><div>Hi all,</div><div><br></div><div>I agree with Vincent regarding EEC, I think it is really mandatory for a </div><div>cluster which does number crunching.</div><div><br></div><div>However, the best cluster does not help if the deployed code does not have a </div><div>test suite to verify the installation. Believe me, that is not an expection, I </div><div>know a number of chemistry codes which are used in practise and there is not </div><div>test suite, or the test suite is broken and it actually says on the code's </div><div>webpage: don't bother using the test suite, it is broken and we know it.</div><div><br></div><div>So you need both: good hardware _and_ good software with a test suite to </div><div>generate meaningful results. If one of the requirements is not met, we might </div><div>as well throw a dice which is cheaper ;-)</div><div><br></div><div>All the best from a wet London</div><div><br></div><div>Jörg</div><div><br></div><div><br></div><div>On Sonntag 04 November 2012 Vincent Diepeveen wrote:</div><blockquote type="cite"><div><div>On Nov 4, 2012, at 5:53 PM, Lux, Jim (337C) wrote:</div><blockquote type="cite"><div><div>On 11/3/12 6:55 PM, "Robin Whittle" <<a href="mailto:rw@firstpr.com.au">rw@firstpr.com.au</a>> wrote:</div><blockquote type="cite"><div><snip></div></blockquote></div></blockquote><div><br></div><div>[snip]</div><div><br></div><blockquote type="cite"><div><blockquote type="cite"><div><div>For serious work, the cluster and its software needs to survive power</div><div>outages, failure of individual servers and memory errors, so ECC</div><div>memory</div><div>is a good investment . . . which typically requires more expensive</div><div>motherboards and CPUs.</div></div></blockquote><div><br></div><div>Actually, I don't know that I would agree with you about ECC, etc.</div><div>ECC</div><div>memory is an attempt to create "perfect memory". As you scale up, the</div><div>assumption of "perfect computation" becomes less realistic, so that</div><div>means</div><div>your application (or the infrastructure on which the application</div><div>sits) has</div><div>to explicitly address failures, because at sufficiently large</div><div>scale, they</div><div>are inevitable. Once you've dealt with that, then whether ECC is</div><div>needed</div><div>or not (or better power supplies, or cooling fans, or lunar gravity</div><div>phase</div><div>compensation, or whatever) is part of your computational design and</div><div>budget: it might be cheaper (using whatever metric) to</div><div>overprovision and</div><div>allow errors than to buy fewer better widgets.</div></div></blockquote><div><br></div><div>I don't know whether for all clusters 'outages' is a big issue - here</div><div>in Western Europe we hardly have</div><div>power failures, so i would imagine it if a company with a cluster</div><div>doesn't invest into batterypacks,</div><div>as their company won't be able to run anyway if there isn't power.</div><div><br></div><div>More interesting is the ECC discussion.</div><div><br></div><div>ECC is simply a requirement IMHO, not a 'luxury thing' as some</div><div>hardware engineers see it.</div><div><br></div><div>I know some memory engineers disagree here - for example one of them</div><div>mentionned to me that "putting ECC onto a GPU</div><div>is nonsense as it is a lot of effort and DDR5 already has a built in</div><div>CRC" something like that (if i remember the quote correctly).</div><div><br></div><div>But they do not administer servers themselves.</div><div><br></div><div>Also they don't understand the accuracy or better LACK of accuracy in</div><div>checking calculations done by</div><div>some who calculate at big iron. If you calculate at a cluster and get</div><div>after some months a result - reality is simply that</div><div>99% of the researchers isn't as good as the Einstein league</div><div>researchers and 90% simply sucks too much by any standards</div><div>in this sense that they wouldn't see an obvious problem get generated</div><div>by a bitflip here or there. They just would</div><div>happily invent a new theory, as we already have seen too much in</div><div>history.</div><div><br></div><div>By simply putting in ECC there you avoid in some percent of the cases</div><div>this 'interpreting the results correctly' problem.</div><div><br></div><div>Furthermore there is too many calculations where a single bitflip</div><div>could be catastrophic and calculating</div><div>for a few months at hundreds of cores is asking for trouble then</div><div>without ECC.</div><div><br></div><div>As last argument i want to note that in many sciences we simply see</div><div>that the post 2nd world war standard of using alpha = 0.05</div><div>or an error of at most 5% (2 x standard deviation), simply isn't</div><div>accurate enough anymore for todays generation of scientists.</div><div><br></div><div>They need more accuracy.</div><div><br></div><div>So historic debates on what is enough or what isn't enough - reducing</div><div>errors by means of using ECC is really important.</div><div><br></div><div>Now that said - if someone shows up with a different form of checking</div><div>that's just as accurate or even better - that would be</div><div> acceptable as well - yet most discussions usually with the hardware</div><div>engineers are typically like: "why do all this effort to get</div><div>rid of a few errors meanwhile my windows laptop if it crashes i just</div><div>reboot it".</div><div><br></div><div>Such sorts of discussions really should be discussions of the past -</div><div>society is moving on - one needs a far higher accuracy and</div><div>reliability now - simply as the CPU's do more calculations and the</div><div>Memory therefore has to serve more bytes per second.</div><div><br></div><div>In all that ECC is a requirement for huge clusters and from my</div><div>viewpoint also for relative tiny clusters.</div><div><br></div><blockquote type="cite"><div><blockquote type="cite"><div><div>I understand that the most serious limitation of this approach is the</div><div>bandwidth and latency (how long it takes for a message to get to the</div><div>destination server) of 1Gbps Ethernet. The most obvious alternatives</div><div>are using multiple 1Gbps Ethernet connections per server (but this is</div><div>complex and only marginally improves bandwidth, while doing little or</div><div>nothing for latency) or upgrading to Infiniband. As far as I know,</div><div>Infiniband is exotic and expensive compared to the mass market</div><div>motherboards etc. from which a Beowulf cluster can be made. In other</div><div>words, I think Infiniband is required to make a cluster work really</div><div>well, but it does not not (yet) meet the original Beowulf goal of</div><div>being</div><div>inexpensive and commonly available.</div></div></blockquote><div><br></div><div>Perhaps a distinction should be made between "original Beowulf" and</div><div>"cluster computer"? As you say, the original idea (espoused in the</div><div>book,</div><div>etc.) is a cluster built from cheap commodity parts. That would mean</div><div>"commodity packaging", "commodity interconnects", etc. which for</div><div>the most</div><div>part meant tower cases and ethernet. However, cheap custom sheet</div><div>metal is</div><div>now available (back when Beowulfs were first being built, rooms</div><div>full of</div><div>servers were still a fairly new and novel thing, and you paid a</div><div>significant premium for rack mount chassis, especially as consumer</div><div>pressure forced the traditional tower case prices down)</div><div><br></div><blockquote type="cite"><div><div>I think this model of HPC cluster computing remains fundamentally</div><div>true,</div><div>but there are two important developments in recent years which either</div><div>alter the way a cluster would be built or used or which may make the</div><div>best solution to a computing problem no longer a cluster. These</div><div>developments are large numbers of CPU cores per server, and the</div><div>use of</div><div>GPUs to do massive amounts of computing, in a single inexpensive</div><div>graphic</div><div>card - more crunching than was possible in massive clusters a decade</div><div>earlier.</div></div></blockquote><div><br></div><div>Yes. But in some ways, utilizing them has the same sort of software</div><div>problem as using multiple nodes in the first place (EP aside). And</div><div>the</div><div>architecture of the interconnects is heterogeneous compared to the</div><div>fairly</div><div>uniform interconnect of a generalized cluster fabric. One can</div><div>raise the</div><div>same issues with cache, by the way.</div><div><br></div><blockquote type="cite"><div><div>The ideal computing system would have a single CPU core which</div><div>could run</div><div>at arbitrarily high frequencies, with low latency, high bandwidth,</div><div>access to an arbitrarily large amount of RAM, with matching links to</div><div>hard disks or other non-volatile storage systems, with a good</div><div>Ethernet</div><div>link to the rest of the world.</div><div><br></div><div>While CPU clock frequencies and computing effort per clock frequency</div><div>have been growing slowly for the last 10 years or so, there has</div><div>been a</div><div>continuing increase in the number of CPU cores per CPU device</div><div>(typically</div><div>a single chip, but sometimes multiple chips in a device which is</div><div>plugged</div><div>into the motherboard) and in the number of CPU devices which can be</div><div>plugged into a motherboard.</div></div></blockquote><div><br></div><div>That's because CPU clock is limited by physics. "work per clock</div><div>cycle" is</div><div>also limited by physics to a certain extent (because today's</div><div>processors</div><div>are mostly synchronous, so you have a propagation delay time from</div><div>one side</div><div>of the processor to the other) except for things like array processors</div><div>(SIMD) but I'd say that's just multiple processors that happen to</div><div>be doing</div><div>the same thing, rather than a single processor doing more.</div><div><br></div><div>The real force driving multiple cores is the incredible expense of</div><div>getting</div><div>on and off chip. Moving a bit across the chip is easy, compared to</div><div>off</div><div>chip: you have to change the voltage levels, have enough current</div><div>to drive</div><div>a trace, propagate down that trace, receive the signal at the other</div><div>end,</div><div>shift voltages again.</div><div><br></div><blockquote type="cite"><div><div>Most mass market motherboards are for a single CPU device, but</div><div>there are</div><div>a few two and four CPU motherboards for Intel and AMD CPUs.</div><div><br></div><div>It is possible to get 4 (mass market) 6, 8, 12 or sometimes 16 CPU</div><div>cores</div><div>per CPU device. I think the 4 core i7 CPUs or their ECC-</div><div>compatible Xeon</div><div>equivalents are marginally faster than those with 6 or 8 cores.</div><div><br></div><div>In all cases, as far as I know, combining multiple CPU cores and/or</div><div>multiple CPU devices results in a single computer system, with a</div><div>single</div><div>operating system and a single body of memory, with multiple CPU cores</div><div>all running around in this shared memory.</div></div></blockquote><div><br></div><div>Yes.. That's a fairly simple model and easy to program for.</div><div><br></div><blockquote type="cite"><div><div> I have no clear idea how each</div><div><br></div><div>CPU core knows what the other cores have written to the RAM they are</div><div>using, since each core is reading and writing via its own cache of</div><div>the</div><div>memory contents. This raises the question of inter-CPU-core</div><div>communications, within a single CPU chip, between chips in a multi-</div><div>chip</div><div>CPU module, and between multiple CPU modules on the one motherboard.</div></div></blockquote><div><br></div><div>Generally handled by the OS kernel. In a multitasking OS, the</div><div>scheduler</div><div>just assigns the next free CPU to the next task. Whether you</div><div>restore the</div><div>context from processor A to processor A or to processor B doesn't make</div><div>much difference. Obviously, there are cache issues (since that's</div><div>part of</div><div>context). This kind of thing is why multiprocessor kernels are non-</div><div>trivial.</div><div><br></div><blockquote type="cite"><div><div>I understand that MPI works identically from the programmer's</div><div>perspective between CPU-cores on a shared memory computer as between</div><div>CPU-cores on separate servers. However, the performance (low latency</div><div>and high bandwidth) of these communications within a single shared</div><div>memory system is vastly higher than between any separate servers,</div><div>which</div><div>would rely on Infiniband or Ethernet.</div></div></blockquote><div><br></div><div>Yes. This is a problem with a simple interconnect model.. It doesn't</div><div>necessarily reflect the cost of the interconnect is different</div><div>depending on</div><div>how far and how fast you're going. That said, there is a fair</div><div>amount of</div><div>research into this. Hypercube processors had limited interconnects</div><div>between nodes (only nearest neighbor) and there are toroidal</div><div>fabrics (2D</div><div>interconnects) as well.</div><div><br></div><blockquote type="cite"><div><div>So even if you have, or are going to write, MPI-based software</div><div>which can</div><div>run on a cluster, there may be an argument for not building a</div><div>cluster as</div><div>such, but for building a single motherboard system with as many as 64</div><div>CPU cores.</div></div></blockquote><div><br></div><div>Sure.. If your problem is of a size that it can be solved by a</div><div>single box,</div><div>then that's usually the way to go. (It applies in areas outside of</div><div>computing.. Better to have one big transmitter tube than lots of</div><div>little</div><div>ones). But it doesn't scale. The instant the problem gets too big,</div><div>then</div><div>you're stuck. The advantage of clusters is that they are</div><div>scalable. Your</div><div>problem gets 2x bigger, in theory, you add another N nodes and you're</div><div>ready to go (Amdahl's law can bite you though).</div><div><br></div><div>There's even been a lot of discussion over the years on this list</div><div>about</div><div>the optimum size cluster to build for a big task, given that</div><div>computers are</div><div>getting cheaper/more powerful. If you've got 2 years worth of</div><div>computing,</div><div>do you buy a computer today that can finish the job in 2 years, or</div><div>do you</div><div>do nothing for a year and buy a computer that is twice as fast in a</div><div>year.</div><div><br></div><blockquote type="cite"><div><div>I think the major new big academic cluster projects focus on</div><div>getting as</div><div>many CPU cores as possible into a single server, while minimising</div><div>power</div><div>consumption per unit of compute power, and then hooking as many as</div><div>possible of these servers together with Infiniband.</div></div></blockquote><div><br></div><div>That might be an aspect of trying to make a general purpose computing</div><div>resource within a specified budget.</div><div><br></div><blockquote type="cite"><div><div>Here is a somewhat rambling discussion of my own thoughts regarding</div><div>clusters and multi-core machines, for my own purposes. My</div><div>interests in</div><div>high performance computing involve music synthesis and physics</div><div>simulation.</div><div><br></div><div>There is an existing, single-threaded (written in C, can't be made</div><div>multithreaded in any reasonable manner) music synthesis program</div><div>called</div><div>Csound. I want to use this now, but as a language for synthesis, I</div><div>think it is extremely clunky. So I plan to write my own program -</div><div>one</div><div>day . . . When I do, it will be written in C++ and</div><div>multithreaded, so</div><div>it will run nicely on multiple CPU-cores in a single machine.</div><div>Writing</div><div>and debugging a multithreaded program is more complex than doing</div><div>so for</div><div>a single-threaded program, but I think it will be practical and a lot</div><div>easier than writing and debugging an MPI based program running</div><div>either on</div><div>on multiple servers or on multiple CPU-cores on a single server.</div></div></blockquote><div><br></div><div>Maybe, maybe not. How is your interthread communication architecture</div><div>structured? Once you bite the bullet and go with a message passing</div><div>model,</div><div>it's a lot more scalable, because you're not doing stuff like "shared</div><div>memory".</div><div><br></div><blockquote type="cite"><div><div>I want to do some simulation of electromagnetic wave propagation</div><div>using</div><div>an existing and widely used MPI-based (C++, open source) program</div><div>called</div><div>Meep. This can run as a single thread, if there is enough RAM, or</div><div>the</div><div>problem can be split up to run over multiple threads using MPI</div><div>communication between the threads. If this is done on a single</div><div>server,</div><div>then the MPI communication is done really quickly, via shared memory,</div><div>which is vastly faster than using Ethernet or Inifiniband to other</div><div>servers. However, this places a limit on the number of CPU-cores and</div><div>the total memory. When simulating three dimensional models, the</div><div>RAM and</div><div>CPU demands can easily become extremely demanding. Meep was</div><div>written to</div><div>split the problem into multiple zones, and to work efficiently</div><div>with MPI.</div></div></blockquote><div><br></div><div>As you note, this is advantage of setting up a message passing</div><div>architecture from the beginning.. It works regardless of the scale/</div><div>method</div><div>of message passing. There *are* differences in performance.</div><div><br></div><blockquote type="cite"><div><div>Ten or 15 years ago, the only way to get more compute power was to</div><div>build</div><div>a cluster and therefore to write the software to use MPI. This was</div><div>because CPU-devices had a single core (Intel Pentium 3 and 4) and</div><div>because it was rare to find motherboards which handled multiple such</div><div>chips.</div></div></blockquote><div><br></div><div>Yes</div><div><br></div><blockquote type="cite"><div><div>The next step would be to get a 4 socket motherboard from Tyan or</div><div>SuperMicro for $800 or so and populate it with 8, 12 or (if money</div><div>permits) 16 core CPUs and a bunch of ECC RAM.</div><div><br></div><div>My forthcoming music synthesis program would run fine with 8 or</div><div>16GB of</div><div>RAM. So one or two of these 16 (2 x 8) to 64 (4 x 16) core Opteron</div><div>machines would do the trick nicely.</div></div></blockquote><div><br></div><div>_______________________________________________</div><div>Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a> sponsored by Penguin</div><div>Computing</div><div>To change your subscription (digest mode or unsubscribe) visit</div><div><a href="http://www.beowulf.org/mailman/listinfo/beowulf">http://www.beowulf.org/mailman/listinfo/beowulf</a></div></div></blockquote><div><br></div><div>_______________________________________________</div><div>Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a> sponsored by Penguin Computing</div><div>To change your subscription (digest mode or unsubscribe) visit</div><div><a href="http://www.beowulf.org/mailman/listinfo/beowulf">http://www.beowulf.org/mailman/listinfo/beowulf</a></div></div></blockquote><div><br></div><div><br></div><div>-- </div><div>*************************************************************</div><div>Jörg Saßmannshausen</div><div>University College London</div><div>Department of Chemistry</div><div>Gordon Street</div><div>London</div><div>WC1H 0AJ </div><div><br></div><div>email: <a href="mailto:j.sassmannshausen@ucl.ac.uk">j.sassmannshausen@ucl.ac.uk</a></div><div>web: <a href="http://sassy.formativ.net">http://sassy.formativ.net</a></div><div><br></div><div>Please avoid sending me Word or PowerPoint attachments.</div><div>See <a href="http://www.gnu.org/philosophy/no-word-attachments.html">http://www.gnu.org/philosophy/no-word-attachments.html</a></div><div><br></div><div>_______________________________________________</div><div>Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a> sponsored by Penguin Computing</div><div>To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf">http://www.beowulf.org/mailman/listinfo/beowulf</a></div></div></div></span>
</blockquote>
<div>
<br>
</div>