[Beowulf] X5500
Vincent Diepeveen
diep at xs4all.nl
Fri Apr 3 14:11:13 PDT 2009
Hi Ellis,
First of all most big clusters are really expensive forms of computing.
Hardware outdates really rapidly.
Most companies sell products, they hardly do research.
Research is exclusively government domain, at least in most European
states.
If a company is carrying out research, it usually gets paid to a
large degree by subsidy.
Recently i was for example at treasury department organisation:
www.senternovum.nl
They've got big budgets (considering how tiny the nation is over
here) to subsidize
new initiatives. That for largest part goes to BIG companies.
So majority of clusters are real generic clusters where obviously
memory is extremely
important.
Now i'm sure you like to hear how a few companies that DO need big
crunching power
are doing it, but i'm not sure they all like to see it posted here.
Sometimes the realization that you can put into action big
calculation power for a specific
area, where before nearly no crunching power was used, is already too
much information for
competitors.
The other part of the clusters is of course big military terrain.
They search for the holy grail.
I'm of course in the area of public holy grail search in games
(computer chess), though difference
between the mathematical holy grail searchers and game tree search is
not that much from absolute
viewpoint seen.
When searching after a holy grail, you can tolerate more errors of
course, as it is about finding
that lucky shot, or approaching with all kind of errors.
The reason why i can tolerate errors in RAM a very little bit more
than others
is because i already store a CRC in the hashtables of Diep.
You might call that paranoia, but that CRC checking REALLY is important.
In shit case the CRC error might also come from different cores
writing in the RAM,
and of course it is TOO EXPENSIVE to have a lock. You can save that
out by storing CRC
with a simple XOR, that's way faster and really gets rid of a single
bitflip easily.
When i'd have 2 bitflips at 32 bits interval at the same time, now
*that* would be nasty,
as XOR doesn't detect that.
So basically what runs usually within L1 and L2 i definitely can't
tolerate errors, it would
crash the applicatoin in many cases. Within the ram that stores
hashtables, you can to some
extend recover from errors.
For the holy grail searchers, there is 2 different areas of course.
One area is the real number crunching types where everything is
embarrassingly parallel.
There is usually not too big RAM requirements, so everything runs
exclusively within L1/L2.
These guys really are power hungry and most of them won't have what
you even would be
able to call a cluster. It's just some sort of specialized monster
machine with special programmed
or special designed hardware.
If you'd calculate effectively the number of gflops per dollar of
what these guys get with nowadays gpu's,
that's of course really cheap compared to the classical definition of
a cluster.
Yet again all these classical clusters have an ECC requirement simply.
Know 1 researcher who is redoing an application and also gets it
granted to do a research a second time
in order to check whether some calculation mistake of the hardware
messed up?
Not at all, there is good examples of some round off error produced
by old clusters, that gave some difference
to existing quantum mechanica, to explain that by adding a new theory
to it.
Of course 30 years later refuted by someone who FINALLY did do a
recalculation in a correct manner and didn't
get that round off error and concluded that the result he actually
saw was for a change a CORRECT result and
that the error the others had in the quantum mechanica theory was
caused by amateurism of a whole generation
of researchers, most of them seen by society as really clever.
The best researcher you can easily fool with hardware, simply because
that's not necessarily his expertise or his
expectation that it makes a mistake.
If you're gonna calculate at hundreds of cores, you sure get some
bitflips in RAM.
ECC is a requirement then.
They don't have 15 years of time like i had for my chess software, to
build in a CRC check myself,
as of course majority of 'users machines' don't have ECC memory.
Vincent
On Apr 3, 2009, at 6:05 AM, Ellis Wilson wrote:
>
> Vincent Diepeveen wrote:
>> Bill,
>>
>> the ONLY price that matters is that of ECC ram when posting in a
>> cluster
>> group.
>>
>>
>> If there is 1 commission that EVER puts a signature underneath a
>> production cluster
>> without ECC ram using x86 processors (gpu's is yet another new thing
>> that is interesting
>> to discuss), then please inform me, as they qualify for a full and
>> thorough investigation
>> by a range of shrinks and psychologists, on how group behaviour could
>> lead to such a
>> total unqualified and naive and total wrong decision; resulting of
>> course in the direct
>> firing of the entire commission and decommissioning them to north
>> part of
>> Norway where they can count the number of iceblocks they see afloat,
>> this for the rest of
>> their life until retirement age,.
>>
>> So in short i can completely ignore your posting.
>>
>> ECC is a requirement, not a luxury.
>
> Though entertainingly put, it would be an error to say "ECC is a
> requirement" for everyone in a "cluster group". I can think of more
> than just a few purposes for clusters that truly do not require the
> accuracy guaranteed by ECC RAM.
>
> Actually as far as errors of the grossest nature go, the only truly
> bad
> one to make on this list is to take something that is true for one
> sector of clustering and apply it to the whole. Now thats just
> dumping
> oil on the torches.
>
> Ellis
>
>
>
>
>
>
>
>
More information about the Beowulf
mailing list