[Beowulf] X5500

Fri Apr 3 14:11:13 PDT 2009

Hi Ellis,

First of all most big clusters are really expensive forms of computing.
Hardware outdates really rapidly.

Most companies sell products, they hardly do research.

Research is exclusively government domain, at least in most European  
states.
If a company is carrying out research, it usually gets paid to a  
large degree by subsidy.

Recently i was for example at treasury department organisation:  
www.senternovum.nl

They've got big budgets (considering how tiny the nation is over  
here) to subsidize
new initiatives. That for largest part goes to BIG companies.

So majority of clusters are real generic clusters where obviously  
memory is extremely
important.

Now i'm sure you like to hear how a few companies that DO need big  
crunching power
are doing it, but i'm not sure they all like to see it posted here.

Sometimes the realization that you can put into action big  
calculation power for a specific
area, where before nearly no crunching power was used, is already too  
much information for
competitors.

The other part of the clusters is of course big military terrain.  
They search for the holy grail.
I'm of course in the area of public holy grail search in games  
(computer chess), though difference
between the mathematical holy grail searchers and game tree search is  
not that much from absolute
viewpoint seen.

When searching after a holy grail, you can tolerate more errors of  
course, as it is about finding
that lucky shot, or approaching with all kind of errors.

The reason why i can tolerate errors in RAM a very little bit more  
than others
is because i already store a CRC in the hashtables of Diep.

You might call that paranoia, but that CRC checking REALLY is important.

In shit case the CRC error might also come from different cores  
writing in the RAM,
and of course it is TOO EXPENSIVE to have a lock. You can save that  
out by storing CRC
with a simple XOR, that's way faster and really gets rid of a single  
bitflip easily.

When i'd have 2 bitflips at 32 bits interval at the same time, now  
*that* would be nasty,
as XOR doesn't detect that.

So basically what runs usually within L1 and L2 i definitely can't  
tolerate errors, it would
crash the applicatoin in many cases. Within the ram that stores  
hashtables, you can to some
extend recover from errors.

For the holy grail searchers, there is 2 different areas of course.

One area is the real number crunching types where everything is  
embarrassingly parallel.
There is usually not too big RAM requirements, so everything runs  
exclusively within L1/L2.
These guys really are power hungry and most of them won't have what  
you even would be
able to call a cluster. It's just some sort of specialized monster  
machine with special programmed
or special designed hardware.

If you'd calculate effectively the number of gflops per dollar of  
what these guys get with nowadays gpu's,
that's of course really cheap compared to the classical definition of  
a cluster.

Yet again all these classical clusters have an ECC requirement simply.

Know 1 researcher who is redoing an application and also gets it  
granted to do a research a second time
in order to check whether some calculation mistake of the hardware  
messed up?

Not at all, there is good examples of some round off error produced  
by old clusters, that gave some difference
to existing quantum mechanica, to explain that by adding a new theory  
to it.

Of course 30 years later refuted by someone who FINALLY did do a  
recalculation in a correct manner and didn't
get that round off error and concluded that the result he actually  
saw was for a change a CORRECT result and
that the error the others had in the quantum mechanica theory was  
caused by amateurism of a whole generation
of researchers, most of them seen by society as really clever.

The best researcher you can easily fool with hardware, simply because  
that's not necessarily his expertise or his
expectation that it makes a mistake.

If you're gonna calculate at hundreds of cores, you sure get some  
bitflips in RAM.

ECC is a requirement then.

They don't have 15 years of time like i had for my chess software, to  
build in a CRC check myself,
as of course majority of 'users machines' don't have ECC memory.

Vincent

On Apr 3, 2009, at 6:05 AM, Ellis Wilson wrote:

>
> Vincent Diepeveen wrote:
>> Bill,
>>
>> the ONLY price that matters is that of ECC ram when posting in a  
>> cluster
>> group.
>>
>>
>> If there is 1 commission that EVER puts a signature underneath a
>> production cluster
>> without ECC ram using x86 processors (gpu's is yet another new thing
>> that is interesting
>> to discuss), then please inform me, as they qualify for a full and
>> thorough investigation
>> by a range of shrinks and psychologists, on how group behaviour could
>> lead to such a
>> total unqualified and naive and total wrong decision; resulting of
>> course in the direct
>> firing of the entire commission and decommissioning them to north  
>> part of
>> Norway where they can count the number of iceblocks they see afloat,
>> this for the rest of
>> their life until retirement age,.
>>
>> So in short i can completely ignore your posting.
>>
>> ECC is a requirement, not a luxury.
>
> Though entertainingly put, it would be an error to say "ECC is a
> requirement" for everyone in a "cluster group".  I can think of more
> than just a few purposes for clusters that truly do not require the
> accuracy guaranteed by ECC RAM.
>
> Actually as far as errors of the grossest nature go, the only truly  
> bad
> one to make on this list is to take something that is true for one
> sector of clustering and apply it to the whole.  Now thats just  
> dumping
> oil on the torches.
>
> Ellis
>
>
>
>
>
>
>
>