[Beowulf] [tt] One million ARM chips challenge Intel bumblebee

Thu Jul 7 12:38:34 PDT 2011

On 07/07/2011 02:26 PM, Lux, Jim (337C) wrote:
>>> It's all about ultimate scalability.  Anybody with a moderate competence (certainly anyone on this
>> list) could devise a scheme to use 1000 perfect processors that never fail to do 1000 quanta of work
>> in unit time.  It's substantially more challenging to devise a scheme to do 1000 quanta of work in
>> unit time on, say, 1500 processors with a 20% failure rate.  Or even in 1.2*unit time.
>>>
>>
>> Just to be clear - I wasn't saying this was a bad idea. Scaling up to
>> this size seems inevitable. I was just imagining the team of admins who
>> would have to be working non-stop to replace dead processors!
>>
>> I wonder what the architecture for this system will be like. I imagine
>> it will be built around small multi-socket blades that are hot-swappable
>> to handle this.
> 
> 
> 
> I think that you just anticipate the failures and deal with them.  It's challenging to write code to do this, but it's certainly a worthy objective. I can easily see a situation where the cost to replace dead units is so high that you just don't bother doing it: it's cheaper to just add more live ones to the "pool".
> 

Did you read the paper that someone else posted a link to? I just read
the first half of it. A good part of this research is focused on
fault-tolerance/resiliency of computer systems. They're not just
interested in creating a computer to mimic the brain, they want to learn
how to mimic the brain's fault-tolerance in computers.

To paraphrase the paper, we lose a neuron a second in our brains for our
entire lives, but we never notice any problems from that. This research
hopes to learn how to duplicate with that this computer, so you could
say hardware failures are desirable and necessary for this research.

Prentice