[Beowulf] ECC

Sun Nov 4 10:11:35 PST 2012

Am 04.11.2012 um 19:06 schrieb Jörg Saßmannshausen:

> Hi all,
> 
> I agree with Vincent regarding EEC, I think it is really mandatory for a 
> cluster which does number crunching.
> 
> However, the best cluster does not help if the deployed code does not have a 
> test suite to verify the installation.

...and any update/patch. Once you upgrade the kernel and/or libraries the test suite has to be run again.

-- Reuti

> Believe me, that is not an expection, I 
> know a number of chemistry codes which are used in practise and there is not 
> test suite, or the test suite is broken and it actually says on the code's 
> webpage: don't bother using the test suite, it is broken and we know it.
> 
> So you need both: good hardware _and_ good software with a test suite to 
> generate meaningful results. If one of the requirements is not met, we might 
> as well throw a dice which is cheaper ;-)
> 
> All the best from a wet London
> 
> Jörg
> 
> 
> On Sonntag 04 November 2012 Vincent Diepeveen wrote:
>> On Nov 4, 2012, at 5:53 PM, Lux, Jim (337C) wrote:
>>> On 11/3/12 6:55 PM, "Robin Whittle" <rw at firstpr.com.au> wrote:
>>>> <snip>
>> 
>> [snip]
>> 
>>>> For serious work, the cluster and its software needs to survive power
>>>> outages, failure of individual servers and memory errors, so ECC
>>>> memory
>>>> is a good investment . . . which typically requires more expensive
>>>> motherboards and CPUs.
>>> 
>>> Actually, I don't know that I would agree with you about ECC, etc.
>>> ECC
>>> memory is an attempt to create "perfect memory".  As you scale up, the
>>> assumption of "perfect computation" becomes less realistic, so that
>>> means
>>> your application (or the infrastructure on which the application
>>> sits) has
>>> to explicitly address failures, because at sufficiently large
>>> scale, they
>>> are inevitable.  Once you've dealt with that, then whether ECC is
>>> needed
>>> or not (or better power supplies, or cooling fans, or lunar gravity
>>> phase
>>> compensation, or whatever) is part of your computational design and
>>> budget:  it might be cheaper (using whatever metric) to
>>> overprovision and
>>> allow errors than to buy fewer better widgets.
>> 
>> I don't know whether for all clusters 'outages' is a big issue - here
>> in Western Europe we hardly have
>> power failures, so i would imagine it if a company with a cluster
>> doesn't invest into batterypacks,
>> as their company won't be able to run anyway if there isn't power.
>> 
>> More interesting is the ECC discussion.
>> 
>> ECC is simply a requirement IMHO, not a 'luxury thing' as some
>> hardware engineers see it.
>> 
>> I know some memory engineers disagree here - for example one of them
>> mentionned to me that "putting ECC onto a GPU
>> is nonsense as it is a lot of effort and DDR5 already has a built in
>> CRC" something like that (if i remember the quote correctly).
>> 
>> But they do not administer servers themselves.
>> 
>> Also they don't understand the accuracy or better LACK of accuracy in
>> checking calculations done by
>> some who calculate at big iron. If you calculate at a cluster and get
>> after some months a result - reality is simply that
>> 99% of the researchers isn't as good as the Einstein league
>> researchers and 90% simply sucks too much by any standards
>> in this sense that they wouldn't see an obvious problem get generated
>> by a bitflip here or there. They just would
>> happily invent a new theory, as we already have seen too much in
>> history.
>> 
>> By simply putting in ECC there you avoid in some percent of the cases
>> this 'interpreting the results correctly' problem.
>> 
>> Furthermore there is too many calculations where a single bitflip
>> could be catastrophic and calculating
>> for a few months at hundreds of cores is asking for trouble then
>> without ECC.
>> 
>> As last argument i want to note that in many sciences we simply see
>> that the post 2nd world war standard of using alpha = 0.05
>> or an error of at most 5% (2 x standard deviation), simply isn't
>> accurate enough anymore for todays generation of scientists.
>> 
>> They need more accuracy.
>> 
>> So historic debates on what is enough or what isn't enough - reducing
>> errors by means of using ECC is really important.
>> 
>> Now that said - if someone shows up with a different form of checking
>> that's just as accurate or even better - that would be
>>  acceptable as well - yet most discussions usually with the hardware
>> engineers are typically like: "why do all this effort to get
>> rid of a few errors meanwhile my windows laptop if it crashes i just
>> reboot it".
>> 
>> Such sorts of discussions really should be discussions of the past -
>> society is moving on - one needs a far higher accuracy and
>> reliability now - simply as the CPU's do more calculations and the
>> Memory therefore has to serve more bytes per second.
>> 
>> In all that ECC is a requirement for huge clusters and from my
>> viewpoint also for relative tiny clusters.
>> 
>>>> I understand that the most serious limitation of this approach is the
>>>> bandwidth and latency (how long it takes for a message to get to the
>>>> destination server) of 1Gbps Ethernet.  The most obvious alternatives
>>>> are using multiple 1Gbps Ethernet connections per server (but this is
>>>> complex and only marginally improves bandwidth, while doing little or
>>>> nothing for latency) or upgrading to Infiniband.  As far as I know,
>>>> Infiniband is exotic and expensive compared to the mass market
>>>> motherboards etc. from which a Beowulf cluster can be made.  In other
>>>> words, I think Infiniband is required to make a cluster work really
>>>> well, but it does not not (yet) meet the original Beowulf goal of
>>>> being
>>>> inexpensive and commonly available.
>>> 
>>> Perhaps a distinction should be made between "original Beowulf" and
>>> "cluster computer"?  As you say, the original idea (espoused in the
>>> book,
>>> etc.) is a cluster built from cheap commodity parts. That would mean
>>> "commodity packaging", "commodity interconnects", etc.  which for
>>> the most
>>> part meant tower cases and ethernet.  However, cheap custom sheet
>>> metal is
>>> now available (back when Beowulfs were first being built, rooms
>>> full of
>>> servers were still a fairly new and novel thing, and you paid a
>>> significant premium for rack mount chassis, especially as consumer
>>> pressure forced the traditional tower case prices down)
>>> 
>>>> I think this model of HPC cluster computing remains fundamentally
>>>> true,
>>>> but there are two important developments in recent years which either
>>>> alter the way a cluster would be built or used or which may make the
>>>> best solution to a computing problem no longer a cluster.  These
>>>> developments are large numbers of CPU cores per server, and the
>>>> use of
>>>> GPUs to do massive amounts of computing, in a single inexpensive
>>>> graphic
>>>> card - more crunching than was possible in massive clusters a decade
>>>> earlier.
>>> 
>>> Yes.  But in some ways, utilizing them has the same sort of software
>>> problem as using multiple nodes in the first place (EP aside).  And
>>> the
>>> architecture of the interconnects is heterogeneous compared to the
>>> fairly
>>> uniform interconnect of a generalized cluster fabric.  One can
>>> raise the
>>> same issues with cache, by the way.
>>> 
>>>> The ideal computing system would have a single CPU core which
>>>> could run
>>>> at arbitrarily high frequencies, with low latency, high bandwidth,
>>>> access to an arbitrarily large amount of RAM, with matching links to
>>>> hard disks or other non-volatile storage systems, with a good
>>>> Ethernet
>>>> link to the rest of the world.
>>>> 
>>>> While CPU clock frequencies and computing effort per clock frequency
>>>> have been growing slowly for the last 10 years or so, there has
>>>> been a
>>>> continuing increase in the number of CPU cores per CPU device
>>>> (typically
>>>> a single chip, but sometimes multiple chips in a device which is
>>>> plugged
>>>> into the motherboard) and in the number of CPU devices which can be
>>>> plugged into a motherboard.
>>> 
>>> That's because CPU clock is limited by physics.  "work per clock
>>> cycle" is
>>> also limited by physics to a certain extent (because today's
>>> processors
>>> are mostly synchronous, so you have a propagation delay time from
>>> one side
>>> of the processor to the other) except for things like array processors
>>> (SIMD) but I'd say that's just multiple processors that happen to
>>> be doing
>>> the same thing, rather than a single processor doing more.
>>> 
>>> The real force driving multiple cores is the incredible expense of
>>> getting
>>> on and off chip.  Moving a bit across the chip is easy, compared to
>>> off
>>> chip:  you have to change the voltage levels, have enough current
>>> to drive
>>> a trace, propagate down that trace, receive the signal at the other
>>> end,
>>> shift voltages again.
>>> 
>>>> Most mass market motherboards are for a single CPU device, but
>>>> there are
>>>> a few two and four CPU motherboards for Intel and AMD CPUs.
>>>> 
>>>> It is possible to get 4 (mass market) 6, 8, 12 or sometimes 16 CPU
>>>> cores
>>>> per CPU device.  I think the 4 core i7 CPUs or their ECC-
>>>> compatible Xeon
>>>> equivalents are marginally faster than those with 6 or 8 cores.
>>>> 
>>>> In all cases, as far as I know, combining multiple CPU cores and/or
>>>> multiple CPU devices results in a single computer system, with a
>>>> single
>>>> operating system and a single body of memory, with multiple CPU cores
>>>> all running around in this shared memory.
>>> 
>>> Yes.. That's a fairly simple model and easy to program for.
>>> 
>>>> I have no clear idea how each
>>>> 
>>>> CPU core knows what the other cores have written to the RAM they are
>>>> using, since each core is reading and writing via its own cache of
>>>> the
>>>> memory contents.  This raises the question of inter-CPU-core
>>>> communications, within a single CPU chip, between chips in a multi-
>>>> chip
>>>> CPU module, and between multiple CPU modules on the one motherboard.
>>> 
>>> Generally handled by the OS kernel.  In a multitasking OS, the
>>> scheduler
>>> just assigns the next free CPU to the next task.  Whether you
>>> restore the
>>> context from processor A to processor A or to processor B doesn't make
>>> much difference.  Obviously, there are cache issues (since that's
>>> part of
>>> context). This kind of thing is why multiprocessor kernels are non-
>>> trivial.
>>> 
>>>> I understand that MPI works identically from the programmer's
>>>> perspective between CPU-cores on a shared memory computer as between
>>>> CPU-cores on separate servers.  However, the performance (low latency
>>>> and high bandwidth) of these communications within a single shared
>>>> memory system is vastly higher than between any separate servers,
>>>> which
>>>> would rely on Infiniband or Ethernet.
>>> 
>>> Yes.  This is a problem with a simple interconnect model.. It doesn't
>>> necessarily reflect the cost of the interconnect is different
>>> depending on
>>> how far and how fast you're going.  That said, there is a fair
>>> amount of
>>> research into this.  Hypercube processors had limited interconnects
>>> between nodes (only nearest neighbor) and there are toroidal
>>> fabrics (2D
>>> interconnects) as well.
>>> 
>>>> So even if you have, or are going to write, MPI-based software
>>>> which can
>>>> run on a cluster, there may be an argument for not building a
>>>> cluster as
>>>> such, but for building a single motherboard system with as many as 64
>>>> CPU cores.
>>> 
>>> Sure.. If your problem is of a size that it can be solved by a
>>> single box,
>>> then that's usually the way to go.  (It applies in areas outside of
>>> computing.. Better to have one big transmitter tube than lots of
>>> little
>>> ones). But it doesn't scale.  The instant the problem gets too big,
>>> then
>>> you're stuck.  The advantage of clusters is that they are
>>> scalable.  Your
>>> problem gets 2x bigger, in theory, you add another N nodes and you're
>>> ready to go (Amdahl's law can bite you though).
>>> 
>>> There's even been a lot of discussion over the years on this list
>>> about
>>> the optimum size cluster to build for a big task, given that
>>> computers are
>>> getting cheaper/more powerful.  If you've got 2 years worth of
>>> computing,
>>> do you buy a computer today that can finish the job in 2 years, or
>>> do you
>>> do nothing for a year and buy a computer that is twice as fast in a
>>> year.
>>> 
>>>> I think the major new big academic cluster projects focus on
>>>> getting as
>>>> many CPU cores as possible into a single server, while minimising
>>>> power
>>>> consumption per unit of compute power, and then hooking as many as
>>>> possible of these servers together with Infiniband.
>>> 
>>> That might be an aspect of trying to make a general purpose computing
>>> resource within a specified budget.
>>> 
>>>> Here is a somewhat rambling discussion of my own thoughts regarding
>>>> clusters and multi-core machines, for my own purposes.  My
>>>> interests in
>>>> high performance computing involve music synthesis and physics
>>>> simulation.
>>>> 
>>>> There is an existing, single-threaded (written in C, can't be made
>>>> multithreaded in any reasonable manner) music synthesis program
>>>> called
>>>> Csound.  I want to use this now, but as a language for synthesis, I
>>>> think it is extremely clunky.  So I plan to write my own program -
>>>> one
>>>> day . . .   When I do, it will be written in C++ and
>>>> multithreaded, so
>>>> it will run nicely on multiple CPU-cores in a single machine.
>>>> Writing
>>>> and debugging a multithreaded program is more complex than doing
>>>> so for
>>>> a single-threaded program, but I think it will be practical and a lot
>>>> easier than writing and debugging an MPI based program running
>>>> either on
>>>> on multiple servers or on multiple CPU-cores on a single server.
>>> 
>>> Maybe, maybe not.  How is your interthread communication architecture
>>> structured?  Once you bite the bullet and go with a message passing
>>> model,
>>> it's a lot more scalable, because you're not doing stuff like "shared
>>> memory".
>>> 
>>>> I want to do some simulation of electromagnetic wave propagation
>>>> using
>>>> an existing and widely used MPI-based (C++, open source) program
>>>> called
>>>> Meep.  This can run as a single thread, if there is enough RAM, or
>>>> the
>>>> problem can be split up to run over multiple threads using MPI
>>>> communication between the threads.  If this is done on a single
>>>> server,
>>>> then the MPI communication is done really quickly, via shared memory,
>>>> which is vastly faster than using Ethernet or Inifiniband to other
>>>> servers.  However, this places a limit on the number of CPU-cores and
>>>> the total memory.  When simulating three dimensional models, the
>>>> RAM and
>>>> CPU demands can easily become extremely demanding.  Meep was
>>>> written to
>>>> split the problem into multiple zones, and to work efficiently
>>>> with MPI.
>>> 
>>> As you note, this is advantage of setting up a message passing
>>> architecture from the beginning.. It works regardless of the scale/
>>> method
>>> of message passing.  There *are* differences in performance.
>>> 
>>>> Ten or 15 years ago, the only way to get more compute power was to
>>>> build
>>>> a cluster and therefore to write the software to use MPI.  This was
>>>> because CPU-devices had a single core (Intel Pentium 3 and 4) and
>>>> because it was rare to find motherboards which handled multiple such
>>>> chips.
>>> 
>>> Yes
>>> 
>>>> The next step would be to get a 4 socket motherboard from Tyan or
>>>> SuperMicro for $800 or so and populate it with 8, 12 or (if money
>>>> permits) 16 core CPUs and a bunch of ECC RAM.
>>>> 
>>>> My forthcoming music synthesis program would run fine with 8 or
>>>> 16GB of
>>>> RAM.  So one or two of these 16 (2 x 8) to 64 (4 x 16) core Opteron
>>>> machines would do the trick nicely.
>>> 
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>>> Computing
>>> To change your subscription (digest mode or unsubscribe) visit
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>> 
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
> 
> 
> -- 
> *************************************************************
> Jörg Saßmannshausen
> University College London
> Department of Chemistry
> Gordon Street
> London
> WC1H 0AJ 
> 
> email: j.sassmannshausen at ucl.ac.uk
> web: http://sassy.formativ.net
> 
> Please avoid sending me Word or PowerPoint attachments.
> See http://www.gnu.org/philosophy/no-word-attachments.html
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>