[Beowulf] ECC

Sun Nov 4 09:46:17 PST 2012

On Nov 4, 2012, at 5:53 PM, Lux, Jim (337C) wrote:

>
>
> On 11/3/12 6:55 PM, "Robin Whittle" <rw at firstpr.com.au> wrote:
>> <snip>

[snip]

>>
>> For serious work, the cluster and its software needs to survive power
>> outages, failure of individual servers and memory errors, so ECC  
>> memory
>> is a good investment . . . which typically requires more expensive
>> motherboards and CPUs.
>
>
> Actually, I don't know that I would agree with you about ECC, etc.   
> ECC
> memory is an attempt to create "perfect memory".  As you scale up, the
> assumption of "perfect computation" becomes less realistic, so that  
> means
> your application (or the infrastructure on which the application  
> sits) has
> to explicitly address failures, because at sufficiently large  
> scale, they
> are inevitable.  Once you've dealt with that, then whether ECC is  
> needed
> or not (or better power supplies, or cooling fans, or lunar gravity  
> phase
> compensation, or whatever) is part of your computational design and
> budget:  it might be cheaper (using whatever metric) to  
> overprovision and
> allow errors than to buy fewer better widgets.
>

I don't know whether for all clusters 'outages' is a big issue - here  
in Western Europe we hardly have
power failures, so i would imagine it if a company with a cluster  
doesn't invest into batterypacks,
as their company won't be able to run anyway if there isn't power.

More interesting is the ECC discussion.

ECC is simply a requirement IMHO, not a 'luxury thing' as some  
hardware engineers see it.

I know some memory engineers disagree here - for example one of them  
mentionned to me that "putting ECC onto a GPU
is nonsense as it is a lot of effort and DDR5 already has a built in  
CRC" something like that (if i remember the quote correctly).

But they do not administer servers themselves.

Also they don't understand the accuracy or better LACK of accuracy in  
checking calculations done by
some who calculate at big iron. If you calculate at a cluster and get  
after some months a result - reality is simply that
99% of the researchers isn't as good as the Einstein league  
researchers and 90% simply sucks too much by any standards
in this sense that they wouldn't see an obvious problem get generated  
by a bitflip here or there. They just would
happily invent a new theory, as we already have seen too much in  
history.

By simply putting in ECC there you avoid in some percent of the cases  
this 'interpreting the results correctly' problem.

Furthermore there is too many calculations where a single bitflip  
could be catastrophic and calculating
for a few months at hundreds of cores is asking for trouble then  
without ECC.

As last argument i want to note that in many sciences we simply see  
that the post 2nd world war standard of using alpha = 0.05
or an error of at most 5% (2 x standard deviation), simply isn't  
accurate enough anymore for todays generation of scientists.

They need more accuracy.

So historic debates on what is enough or what isn't enough - reducing  
errors by means of using ECC is really important.

Now that said - if someone shows up with a different form of checking  
that's just as accurate or even better - that would be
  acceptable as well - yet most discussions usually with the hardware  
engineers are typically like: "why do all this effort to get
rid of a few errors meanwhile my windows laptop if it crashes i just  
reboot it".

Such sorts of discussions really should be discussions of the past -  
society is moving on - one needs a far higher accuracy and
reliability now - simply as the CPU's do more calculations and the  
Memory therefore has to serve more bytes per second.

In all that ECC is a requirement for huge clusters and from my  
viewpoint also for relative tiny clusters.

>
>
>
>>
>> I understand that the most serious limitation of this approach is the
>> bandwidth and latency (how long it takes for a message to get to the
>> destination server) of 1Gbps Ethernet.  The most obvious alternatives
>> are using multiple 1Gbps Ethernet connections per server (but this is
>> complex and only marginally improves bandwidth, while doing little or
>> nothing for latency) or upgrading to Infiniband.  As far as I know,
>> Infiniband is exotic and expensive compared to the mass market
>> motherboards etc. from which a Beowulf cluster can be made.  In other
>> words, I think Infiniband is required to make a cluster work really
>> well, but it does not not (yet) meet the original Beowulf goal of  
>> being
>> inexpensive and commonly available.
>
> Perhaps a distinction should be made between "original Beowulf" and
> "cluster computer"?  As you say, the original idea (espoused in the  
> book,
> etc.) is a cluster built from cheap commodity parts. That would mean
> "commodity packaging", "commodity interconnects", etc.  which for  
> the most
> part meant tower cases and ethernet.  However, cheap custom sheet  
> metal is
> now available (back when Beowulfs were first being built, rooms  
> full of
> servers were still a fairly new and novel thing, and you paid a
> significant premium for rack mount chassis, especially as consumer
> pressure forced the traditional tower case prices down)
>
>
>
>
>
>
>>
>>
>> I think this model of HPC cluster computing remains fundamentally  
>> true,
>> but there are two important developments in recent years which either
>> alter the way a cluster would be built or used or which may make the
>> best solution to a computing problem no longer a cluster.  These
>> developments are large numbers of CPU cores per server, and the  
>> use of
>> GPUs to do massive amounts of computing, in a single inexpensive  
>> graphic
>> card - more crunching than was possible in massive clusters a decade
>> earlier.
>
> Yes.  But in some ways, utilizing them has the same sort of software
> problem as using multiple nodes in the first place (EP aside).  And  
> the
> architecture of the interconnects is heterogeneous compared to the  
> fairly
> uniform interconnect of a generalized cluster fabric.  One can  
> raise the
> same issues with cache, by the way.
>
>
>
>>
>> The ideal computing system would have a single CPU core which  
>> could run
>> at arbitrarily high frequencies, with low latency, high bandwidth,
>> access to an arbitrarily large amount of RAM, with matching links to
>> hard disks or other non-volatile storage systems, with a good  
>> Ethernet
>> link to the rest of the world.
>>
>> While CPU clock frequencies and computing effort per clock frequency
>> have been growing slowly for the last 10 years or so, there has  
>> been a
>> continuing increase in the number of CPU cores per CPU device  
>> (typically
>> a single chip, but sometimes multiple chips in a device which is  
>> plugged
>> into the motherboard) and in the number of CPU devices which can be
>> plugged into a motherboard.
>
>
> That's because CPU clock is limited by physics.  "work per clock  
> cycle" is
> also limited by physics to a certain extent (because today's  
> processors
> are mostly synchronous, so you have a propagation delay time from  
> one side
> of the processor to the other) except for things like array processors
> (SIMD) but I'd say that's just multiple processors that happen to  
> be doing
> the same thing, rather than a single processor doing more.
>
> The real force driving multiple cores is the incredible expense of  
> getting
> on and off chip.  Moving a bit across the chip is easy, compared to  
> off
> chip:  you have to change the voltage levels, have enough current  
> to drive
> a trace, propagate down that trace, receive the signal at the other  
> end,
> shift voltages again.
>
>
>
>>
>> Most mass market motherboards are for a single CPU device, but  
>> there are
>> a few two and four CPU motherboards for Intel and AMD CPUs.
>>
>> It is possible to get 4 (mass market) 6, 8, 12 or sometimes 16 CPU  
>> cores
>> per CPU device.  I think the 4 core i7 CPUs or their ECC- 
>> compatible Xeon
>> equivalents are marginally faster than those with 6 or 8 cores.
>>
>> In all cases, as far as I know, combining multiple CPU cores and/or
>> multiple CPU devices results in a single computer system, with a  
>> single
>> operating system and a single body of memory, with multiple CPU cores
>> all running around in this shared memory.
>
> Yes.. That's a fairly simple model and easy to program for.
>
>>  I have no clear idea how each
>> CPU core knows what the other cores have written to the RAM they are
>> using, since each core is reading and writing via its own cache of  
>> the
>> memory contents.  This raises the question of inter-CPU-core
>> communications, within a single CPU chip, between chips in a multi- 
>> chip
>> CPU module, and between multiple CPU modules on the one motherboard.
>
>
> Generally handled by the OS kernel.  In a multitasking OS, the  
> scheduler
> just assigns the next free CPU to the next task.  Whether you  
> restore the
> context from processor A to processor A or to processor B doesn't make
> much difference.  Obviously, there are cache issues (since that's  
> part of
> context). This kind of thing is why multiprocessor kernels are non- 
> trivial.
>
>> I understand that MPI works identically from the programmer's
>> perspective between CPU-cores on a shared memory computer as between
>> CPU-cores on separate servers.  However, the performance (low latency
>> and high bandwidth) of these communications within a single shared
>> memory system is vastly higher than between any separate servers,  
>> which
>> would rely on Infiniband or Ethernet.
>
>
> Yes.  This is a problem with a simple interconnect model.. It doesn't
> necessarily reflect the cost of the interconnect is different  
> depending on
> how far and how fast you're going.  That said, there is a fair  
> amount of
> research into this.  Hypercube processors had limited interconnects
> between nodes (only nearest neighbor) and there are toroidal  
> fabrics (2D
> interconnects) as well.
>>
>> So even if you have, or are going to write, MPI-based software  
>> which can
>> run on a cluster, there may be an argument for not building a  
>> cluster as
>> such, but for building a single motherboard system with as many as 64
>> CPU cores.
>
>
> Sure.. If your problem is of a size that it can be solved by a  
> single box,
> then that's usually the way to go.  (It applies in areas outside of
> computing.. Better to have one big transmitter tube than lots of  
> little
> ones). But it doesn't scale.  The instant the problem gets too big,  
> then
> you're stuck.  The advantage of clusters is that they are  
> scalable.  Your
> problem gets 2x bigger, in theory, you add another N nodes and you're
> ready to go (Amdahl's law can bite you though).
>
> There's even been a lot of discussion over the years on this list  
> about
> the optimum size cluster to build for a big task, given that  
> computers are
> getting cheaper/more powerful.  If you've got 2 years worth of  
> computing,
> do you buy a computer today that can finish the job in 2 years, or  
> do you
> do nothing for a year and buy a computer that is twice as fast in a  
> year.
>
>>
>> I think the major new big academic cluster projects focus on  
>> getting as
>> many CPU cores as possible into a single server, while minimising  
>> power
>> consumption per unit of compute power, and then hooking as many as
>> possible of these servers together with Infiniband.
>
> That might be an aspect of trying to make a general purpose computing
> resource within a specified budget.
>
>
>>
>> Here is a somewhat rambling discussion of my own thoughts regarding
>> clusters and multi-core machines, for my own purposes.  My  
>> interests in
>> high performance computing involve music synthesis and physics  
>> simulation.
>>
>> There is an existing, single-threaded (written in C, can't be made
>> multithreaded in any reasonable manner) music synthesis program  
>> called
>> Csound.  I want to use this now, but as a language for synthesis, I
>> think it is extremely clunky.  So I plan to write my own program -  
>> one
>> day . . .   When I do, it will be written in C++ and  
>> multithreaded, so
>> it will run nicely on multiple CPU-cores in a single machine.   
>> Writing
>> and debugging a multithreaded program is more complex than doing  
>> so for
>> a single-threaded program, but I think it will be practical and a lot
>> easier than writing and debugging an MPI based program running  
>> either on
>> on multiple servers or on multiple CPU-cores on a single server.
>
> Maybe, maybe not.  How is your interthread communication architecture
> structured?  Once you bite the bullet and go with a message passing  
> model,
> it's a lot more scalable, because you're not doing stuff like "shared
> memory".
>
>
>
>>
>> I want to do some simulation of electromagnetic wave propagation  
>> using
>> an existing and widely used MPI-based (C++, open source) program  
>> called
>> Meep.  This can run as a single thread, if there is enough RAM, or  
>> the
>> problem can be split up to run over multiple threads using MPI
>> communication between the threads.  If this is done on a single  
>> server,
>> then the MPI communication is done really quickly, via shared memory,
>> which is vastly faster than using Ethernet or Inifiniband to other
>> servers.  However, this places a limit on the number of CPU-cores and
>> the total memory.  When simulating three dimensional models, the  
>> RAM and
>> CPU demands can easily become extremely demanding.  Meep was  
>> written to
>> split the problem into multiple zones, and to work efficiently  
>> with MPI.
>
> As you note, this is advantage of setting up a message passing
> architecture from the beginning.. It works regardless of the scale/ 
> method
> of message passing.  There *are* differences in performance.
>
>>
>> Ten or 15 years ago, the only way to get more compute power was to  
>> build
>> a cluster and therefore to write the software to use MPI.  This was
>> because CPU-devices had a single core (Intel Pentium 3 and 4) and
>> because it was rare to find motherboards which handled multiple such
>> chips.
>
> Yes
>
>>
>> The next step would be to get a 4 socket motherboard from Tyan or
>> SuperMicro for $800 or so and populate it with 8, 12 or (if money
>> permits) 16 core CPUs and a bunch of ECC RAM.
>>
>> My forthcoming music synthesis program would run fine with 8 or  
>> 16GB of
>> RAM.  So one or two of these 16 (2 x 8) to 64 (4 x 16) core Opteron
>> machines would do the trick nicely.
>
>>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
> Computing
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf