[Beowulf] X5500
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Vincent Diepeveen diep at xs4all.nlFri Apr 3 14:11:13 PDT 2009
- Previous message: [Beowulf] X5500
- Next message: [Beowulf] FPU performance of Intel CPUs
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Ellis, First of all most big clusters are really expensive forms of computing. Hardware outdates really rapidly. Most companies sell products, they hardly do research. Research is exclusively government domain, at least in most European states. If a company is carrying out research, it usually gets paid to a large degree by subsidy. Recently i was for example at treasury department organisation: www.senternovum.nl They've got big budgets (considering how tiny the nation is over here) to subsidize new initiatives. That for largest part goes to BIG companies. So majority of clusters are real generic clusters where obviously memory is extremely important. Now i'm sure you like to hear how a few companies that DO need big crunching power are doing it, but i'm not sure they all like to see it posted here. Sometimes the realization that you can put into action big calculation power for a specific area, where before nearly no crunching power was used, is already too much information for competitors. The other part of the clusters is of course big military terrain. They search for the holy grail. I'm of course in the area of public holy grail search in games (computer chess), though difference between the mathematical holy grail searchers and game tree search is not that much from absolute viewpoint seen. When searching after a holy grail, you can tolerate more errors of course, as it is about finding that lucky shot, or approaching with all kind of errors. The reason why i can tolerate errors in RAM a very little bit more than others is because i already store a CRC in the hashtables of Diep. You might call that paranoia, but that CRC checking REALLY is important. In shit case the CRC error might also come from different cores writing in the RAM, and of course it is TOO EXPENSIVE to have a lock. You can save that out by storing CRC with a simple XOR, that's way faster and really gets rid of a single bitflip easily. When i'd have 2 bitflips at 32 bits interval at the same time, now *that* would be nasty, as XOR doesn't detect that. So basically what runs usually within L1 and L2 i definitely can't tolerate errors, it would crash the applicatoin in many cases. Within the ram that stores hashtables, you can to some extend recover from errors. For the holy grail searchers, there is 2 different areas of course. One area is the real number crunching types where everything is embarrassingly parallel. There is usually not too big RAM requirements, so everything runs exclusively within L1/L2. These guys really are power hungry and most of them won't have what you even would be able to call a cluster. It's just some sort of specialized monster machine with special programmed or special designed hardware. If you'd calculate effectively the number of gflops per dollar of what these guys get with nowadays gpu's, that's of course really cheap compared to the classical definition of a cluster. Yet again all these classical clusters have an ECC requirement simply. Know 1 researcher who is redoing an application and also gets it granted to do a research a second time in order to check whether some calculation mistake of the hardware messed up? Not at all, there is good examples of some round off error produced by old clusters, that gave some difference to existing quantum mechanica, to explain that by adding a new theory to it. Of course 30 years later refuted by someone who FINALLY did do a recalculation in a correct manner and didn't get that round off error and concluded that the result he actually saw was for a change a CORRECT result and that the error the others had in the quantum mechanica theory was caused by amateurism of a whole generation of researchers, most of them seen by society as really clever. The best researcher you can easily fool with hardware, simply because that's not necessarily his expertise or his expectation that it makes a mistake. If you're gonna calculate at hundreds of cores, you sure get some bitflips in RAM. ECC is a requirement then. They don't have 15 years of time like i had for my chess software, to build in a CRC check myself, as of course majority of 'users machines' don't have ECC memory. Vincent On Apr 3, 2009, at 6:05 AM, Ellis Wilson wrote: > > Vincent Diepeveen wrote: >> Bill, >> >> the ONLY price that matters is that of ECC ram when posting in a >> cluster >> group. >> >> >> If there is 1 commission that EVER puts a signature underneath a >> production cluster >> without ECC ram using x86 processors (gpu's is yet another new thing >> that is interesting >> to discuss), then please inform me, as they qualify for a full and >> thorough investigation >> by a range of shrinks and psychologists, on how group behaviour could >> lead to such a >> total unqualified and naive and total wrong decision; resulting of >> course in the direct >> firing of the entire commission and decommissioning them to north >> part of >> Norway where they can count the number of iceblocks they see afloat, >> this for the rest of >> their life until retirement age,. >> >> So in short i can completely ignore your posting. >> >> ECC is a requirement, not a luxury. > > Though entertainingly put, it would be an error to say "ECC is a > requirement" for everyone in a "cluster group". I can think of more > than just a few purposes for clusters that truly do not require the > accuracy guaranteed by ECC RAM. > > Actually as far as errors of the grossest nature go, the only truly > bad > one to make on this list is to take something that is true for one > sector of clustering and apply it to the whole. Now thats just > dumping > oil on the torches. > > Ellis > > > > > > > >
- Previous message: [Beowulf] X5500
- Next message: [Beowulf] FPU performance of Intel CPUs
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
