[Beowulf] A cluster of Arduinos

Thu Jan 12 09:54:52 PST 2012

-----Original Message-----
From: Douglas Eadline [mailto:deadline at eadline.org] 
Sent: Thursday, January 12, 2012 8:49 AM
To: Lux, Jim (337C)
Cc: beowulf at beowulf.org
Subject: Re: [Beowulf] A cluster of Arduinos

snip
>
>
> For my own work, I'd rather have people who are interested in solving 
> problems by ganging up multiple failure prone processors, rather than 
> centralizing it all in one monolithic box (even if the box happens to 
> have multiple cores).
>

This is going to be an exascale issue. i.e. how to compute on a systems whose parts might be in a constant state of breaking. An other interesting question is how do you know you are getting the right answer on a *really* large system?

Of course I spend much of my time optimizing really small systems.

--

Your point about scaling is well taken.. so far, the computing world has largely dealt with things by trying to make the processor perfect and error free.  Some limited areas of error correction are popular (RAM).  But think in a bigger area... say your arithmetic unit has some infrequent unknown errors (e.g. FDIV bug on Pentium).. could clever algorithm design and multiple processors (or multi cores) mitigate this (e.g. instead of just computing  Z = X/Y you also compute Z1 = (X*2)/(Y*2).. and compare answers... that exact example's not great because you've added 2 operations, but I can see that there are other clever techniques that might be possible.. )  

What is nice if you can do things like temporal redundancy (do the calculation twice, and if it's different, do it a third time), or even better some sort of "check calculation" that takes small time compared to mainline calculation.

This, I think, is somewhere that even the big iron/cluster folks could be doing some research.  What are optimum communication fabrics to support this kind of "side calculation" which may have different communication patterns and data flow than the "mainline".  It has a parallel in things like CRC checks in communications protocols.  A lot of hardware has a dedicated little CRC checker that is continuously calculating the CRC as the bits arrive, so that when you get to the end of the frame, the answer is already there.  

And Doug, your small systems have a lot of the same issues, perhaps because that small Limulus might be operated in environments other than what the underlying hardware was designed for.  I know people who have been rudely surprised when they found that the design environment for a laptop is a pretty narrow temperature range (e.g. office desktop) and when they put them in a car, subject to 0C or 40C temperatures, if not wider, that things don't work quite as well as expected.

Very small systems (few nodes) have the same issues, in some environments (e.g. a cluster subject to single event upsets or functional interrupts in a high radiation environment with a lot of high energy charged particles. it's not so much a total dose thing, but a SEE thing)

For Juno (which is in polar orbit around Jupiter), we shielded everything in a vault (a 1 meter cube with 1cm thick titanium walls) and still it's an issue.  We don't get very long before everything is cooked. 

And I think that a non-trivially small cluster (e.g. more than 4 nodes, I think) you could do a lot of experimentation on techniques.

(oddly, simulated fault injection is one of the trickier parts)