[Beowulf] A cluster of Arduinos

Fri Jan 13 07:18:02 PST 2012

>
>
> -----Original Message-----
> From: Douglas Eadline [mailto:deadline at eadline.org]
> Sent: Thursday, January 12, 2012 8:49 AM
> To: Lux, Jim (337C)
> Cc: beowulf at beowulf.org
> Subject: Re: [Beowulf] A cluster of Arduinos
>
> snip
>>
>>
>> For my own work, I'd rather have people who are interested in solving
>> problems by ganging up multiple failure prone processors, rather than
>> centralizing it all in one monolithic box (even if the box happens to
>> have multiple cores).
>>
>
> This is going to be an exascale issue. i.e. how to compute on a systems
> whose parts might be in a constant state of breaking. An other interesting
> question is how do you know you are getting the right answer on a *really*
> large system?
>
> Of course I spend much of my time optimizing really small systems.
>
> --
>
> Your point about scaling is well taken.. so far, the computing world has
> largely dealt with things by trying to make the processor perfect and
> error free.  Some limited areas of error correction are popular (RAM).
> But think in a bigger area... say your arithmetic unit has some infrequent
> unknown errors (e.g. FDIV bug on Pentium).. could clever algorithm design
> and multiple processors (or multi cores) mitigate this (e.g. instead of
> just computing  Z = X/Y you also compute Z1 = (X*2)/(Y*2).. and compare
> answers... that exact example's not great because you've added 2
> operations, but I can see that there are other clever techniques that
> might be possible.. )
>
> What is nice if you can do things like temporal redundancy (do the
> calculation twice, and if it's different, do it a third time), or even
> better some sort of "check calculation" that takes small time compared to
> mainline calculation.
>
> This, I think, is somewhere that even the big iron/cluster folks could be
> doing some research.  What are optimum communication fabrics to support
> this kind of "side calculation" which may have different communication
> patterns and data flow than the "mainline".  It has a parallel in things
> like CRC checks in communications protocols.  A lot of hardware has a
> dedicated little CRC checker that is continuously calculating the CRC as
> the bits arrive, so that when you get to the end of the frame, the answer
> is already there.
>
>
> And Doug, your small systems have a lot of the same issues, perhaps
> because that small Limulus might be operated in environments other than
> what the underlying hardware was designed for.  I know people who have
> been rudely surprised when they found that the design environment for a
> laptop is a pretty narrow temperature range (e.g. office desktop) and when
> they put them in a car, subject to 0C or 40C temperatures, if not wider,
> that things don't work quite as well as expected.

I will be curious to see where these things show up since
all you really need is a power plug. (a little nervous actually).

>
> Very small systems (few nodes) have the same issues, in some environments
> (e.g. a cluster subject to single event upsets or functional interrupts in
> a high radiation environment with a lot of high energy charged particles.
> it's not so much a total dose thing, but a SEE thing)
>
> For Juno (which is in polar orbit around Jupiter), we shielded everything
> in a vault (a 1 meter cube with 1cm thick titanium walls) and still it's
> an issue.  We don't get very long before everything is cooked.
>
> And I think that a non-trivially small cluster (e.g. more than 4 nodes, I
> think) you could do a lot of experimentation on techniques.

I agree. Four nodes is really small. BTW, the most fun in designing
this system is a set of tighter constraints than are found on the typical
cluster. Noise, power, space, cabling, low cost packaging, etc. I have
been asked about a rack mount version, we'll see.

One thing I find interesting is the core/node efficiency.
(what I call "effective cores") In general *on some codes*, I found that
less cores (1P micro-atx 4-cores) is more efficient than many
cores (2P server 12-core). Seems obvious, but I like to test things.

>
>
> (oddly, simulated fault injection is one of the trickier parts)
>

I would assume, because in a sense, the black swan* is
by definition hard to predict.

(* the book by Nick Taleb, not the movie)

--
Doug

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.