[Beowulf] A cluster of Arduinos
deadline at eadline.org
Fri Jan 13 07:18:02 PST 2012
> -----Original Message-----
> From: Douglas Eadline [mailto:deadline at eadline.org]
> Sent: Thursday, January 12, 2012 8:49 AM
> To: Lux, Jim (337C)
> Cc: beowulf at beowulf.org
> Subject: Re: [Beowulf] A cluster of Arduinos
>> For my own work, I'd rather have people who are interested in solving
>> problems by ganging up multiple failure prone processors, rather than
>> centralizing it all in one monolithic box (even if the box happens to
>> have multiple cores).
> This is going to be an exascale issue. i.e. how to compute on a systems
> whose parts might be in a constant state of breaking. An other interesting
> question is how do you know you are getting the right answer on a *really*
> large system?
> Of course I spend much of my time optimizing really small systems.
> Your point about scaling is well taken.. so far, the computing world has
> largely dealt with things by trying to make the processor perfect and
> error free. Some limited areas of error correction are popular (RAM).
> But think in a bigger area... say your arithmetic unit has some infrequent
> unknown errors (e.g. FDIV bug on Pentium).. could clever algorithm design
> and multiple processors (or multi cores) mitigate this (e.g. instead of
> just computing Z = X/Y you also compute Z1 = (X*2)/(Y*2).. and compare
> answers... that exact example's not great because you've added 2
> operations, but I can see that there are other clever techniques that
> might be possible.. )
> What is nice if you can do things like temporal redundancy (do the
> calculation twice, and if it's different, do it a third time), or even
> better some sort of "check calculation" that takes small time compared to
> mainline calculation.
> This, I think, is somewhere that even the big iron/cluster folks could be
> doing some research. What are optimum communication fabrics to support
> this kind of "side calculation" which may have different communication
> patterns and data flow than the "mainline". It has a parallel in things
> like CRC checks in communications protocols. A lot of hardware has a
> dedicated little CRC checker that is continuously calculating the CRC as
> the bits arrive, so that when you get to the end of the frame, the answer
> is already there.
> And Doug, your small systems have a lot of the same issues, perhaps
> because that small Limulus might be operated in environments other than
> what the underlying hardware was designed for. I know people who have
> been rudely surprised when they found that the design environment for a
> laptop is a pretty narrow temperature range (e.g. office desktop) and when
> they put them in a car, subject to 0C or 40C temperatures, if not wider,
> that things don't work quite as well as expected.
I will be curious to see where these things show up since
all you really need is a power plug. (a little nervous actually).
> Very small systems (few nodes) have the same issues, in some environments
> (e.g. a cluster subject to single event upsets or functional interrupts in
> a high radiation environment with a lot of high energy charged particles.
> it's not so much a total dose thing, but a SEE thing)
> For Juno (which is in polar orbit around Jupiter), we shielded everything
> in a vault (a 1 meter cube with 1cm thick titanium walls) and still it's
> an issue. We don't get very long before everything is cooked.
> And I think that a non-trivially small cluster (e.g. more than 4 nodes, I
> think) you could do a lot of experimentation on techniques.
I agree. Four nodes is really small. BTW, the most fun in designing
this system is a set of tighter constraints than are found on the typical
cluster. Noise, power, space, cabling, low cost packaging, etc. I have
been asked about a rack mount version, we'll see.
One thing I find interesting is the core/node efficiency.
(what I call "effective cores") In general *on some codes*, I found that
less cores (1P micro-atx 4-cores) is more efficient than many
cores (2P server 12-core). Seems obvious, but I like to test things.
> (oddly, simulated fault injection is one of the trickier parts)
I would assume, because in a sense, the black swan* is
by definition hard to predict.
(* the book by Nick Taleb, not the movie)
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
More information about the Beowulf