[Beowulf] Mark Hahn's Beowulf/Cluster/HPC mini-FAQ for newbies & some further thoughts

Sun Nov 4 08:53:35 PST 2012

On 11/3/12 6:55 PM, "Robin Whittle" <rw at firstpr.com.au> wrote:
><snip>

>The Wikipedia articles:
>
>  http://en.wikipedia.org/wiki/High-performance_computing
>  http://en.wikipedia.org/wiki/High-throughput_computing
>
>are of general interest.
>
>Here are some thoughts which might complement what Mark wrote.  I am a
>complete newbie to this field and I hope more experienced people will
>correct or expand on the following.
>
>HPC has quite a long history and my understanding of the Beowulf concept
>is to make clusters using easily available hardware and software, such
>as operating systems, servers and interconnects.  The Interconnects, I
>think are invariably Ethernet, since anything else is, or at least was
>until recently, exotic and therefore expensive.

That was probably true up til about 5-10 years ago.  And it is still true
for "small" clusters.  I don't have any stats to back this up, but my gut
feel is that for clusters over 100 nodes (give or take) other
interconnects are more popular.

That said, what nodes typically have is two interfaces: one is the main
interconnect used for computation; the other is a ethernet for
administrative functions.  The admin network isn't really performance
constrained.. You use it to do things like collect temperature data, boot
support, stuff like that.

>
>Traditionally - going back a decade or more - I think the assumptions
>have been that a single server has one or maybe two CPU cores and that
>the only way to get a single system to do more work is to interconnect a
>large number of these servers with a good file system and fast
>inter-server communications so that suitably written software (such as
>something written to use MPI so the instances of the software on
>multiple servers and/or CPU cores and work on the one problem
>efficiently) can do the job well.  All these things - power supplies,
>motherboards, CPUs, RAM, Ethernet (one or two 1GBps Ethernet on the
>motherboard) and Ethernet switches are inexpensive and easy to throw
>together into a smallish cluster.

Yes.

>
>However, creating software to use it efficiently can be very tricky -
>unless of course the need can be satisfied by MPI-based software which
>someone else has already written.

yes

>
>For serious work, the cluster and its software needs to survive power
>outages, failure of individual servers and memory errors, so ECC memory
>is a good investment . . . which typically requires more expensive
>motherboards and CPUs.

Actually, I don't know that I would agree with you about ECC, etc.  ECC
memory is an attempt to create "perfect memory".  As you scale up, the
assumption of "perfect computation" becomes less realistic, so that means
your application (or the infrastructure on which the application sits) has
to explicitly address failures, because at sufficiently large scale, they
are inevitable.  Once you've dealt with that, then whether ECC is needed
or not (or better power supplies, or cooling fans, or lunar gravity phase
compensation, or whatever) is part of your computational design and
budget:  it might be cheaper (using whatever metric) to overprovision and
allow errors than to buy fewer better widgets.

>
>I understand that the most serious limitation of this approach is the
>bandwidth and latency (how long it takes for a message to get to the
>destination server) of 1Gbps Ethernet.  The most obvious alternatives
>are using multiple 1Gbps Ethernet connections per server (but this is
>complex and only marginally improves bandwidth, while doing little or
>nothing for latency) or upgrading to Infiniband.  As far as I know,
>Infiniband is exotic and expensive compared to the mass market
>motherboards etc. from which a Beowulf cluster can be made.  In other
>words, I think Infiniband is required to make a cluster work really
>well, but it does not not (yet) meet the original Beowulf goal of being
>inexpensive and commonly available.

Perhaps a distinction should be made between "original Beowulf" and
"cluster computer"?  As you say, the original idea (espoused in the book,
etc.) is a cluster built from cheap commodity parts. That would mean
"commodity packaging", "commodity interconnects", etc.  which for the most
part meant tower cases and ethernet.  However, cheap custom sheet metal is
now available (back when Beowulfs were first being built, rooms full of
servers were still a fairly new and novel thing, and you paid a
significant premium for rack mount chassis, especially as consumer
pressure forced the traditional tower case prices down)

>
>
>I think this model of HPC cluster computing remains fundamentally true,
>but there are two important developments in recent years which either
>alter the way a cluster would be built or used or which may make the
>best solution to a computing problem no longer a cluster.  These
>developments are large numbers of CPU cores per server, and the use of
>GPUs to do massive amounts of computing, in a single inexpensive graphic
>card - more crunching than was possible in massive clusters a decade
>earlier.

Yes.  But in some ways, utilizing them has the same sort of software
problem as using multiple nodes in the first place (EP aside).  And the
architecture of the interconnects is heterogeneous compared to the fairly
uniform interconnect of a generalized cluster fabric.  One can raise the
same issues with cache, by the way.

>
>The ideal computing system would have a single CPU core which could run
>at arbitrarily high frequencies, with low latency, high bandwidth,
>access to an arbitrarily large amount of RAM, with matching links to
>hard disks or other non-volatile storage systems, with a good Ethernet
>link to the rest of the world.
>
>While CPU clock frequencies and computing effort per clock frequency
>have been growing slowly for the last 10 years or so, there has been a
>continuing increase in the number of CPU cores per CPU device (typically
>a single chip, but sometimes multiple chips in a device which is plugged
>into the motherboard) and in the number of CPU devices which can be
>plugged into a motherboard.

That's because CPU clock is limited by physics.  "work per clock cycle" is
also limited by physics to a certain extent (because today's processors
are mostly synchronous, so you have a propagation delay time from one side
of the processor to the other) except for things like array processors
(SIMD) but I'd say that's just multiple processors that happen to be doing
the same thing, rather than a single processor doing more.

The real force driving multiple cores is the incredible expense of getting
on and off chip.  Moving a bit across the chip is easy, compared to off
chip:  you have to change the voltage levels, have enough current to drive
a trace, propagate down that trace, receive the signal at the other end,
shift voltages again.

>
>Most mass market motherboards are for a single CPU device, but there are
>a few two and four CPU motherboards for Intel and AMD CPUs.
>
>It is possible to get 4 (mass market) 6, 8, 12 or sometimes 16 CPU cores
>per CPU device.  I think the 4 core i7 CPUs or their ECC-compatible Xeon
>equivalents are marginally faster than those with 6 or 8 cores.
>
>In all cases, as far as I know, combining multiple CPU cores and/or
>multiple CPU devices results in a single computer system, with a single
>operating system and a single body of memory, with multiple CPU cores
>all running around in this shared memory.

Yes.. That's a fairly simple model and easy to program for.

>  I have no clear idea how each
>CPU core knows what the other cores have written to the RAM they are
>using, since each core is reading and writing via its own cache of the
>memory contents.  This raises the question of inter-CPU-core
>communications, within a single CPU chip, between chips in a multi-chip
>CPU module, and between multiple CPU modules on the one motherboard.

Generally handled by the OS kernel.  In a multitasking OS, the scheduler
just assigns the next free CPU to the next task.  Whether you restore the
context from processor A to processor A or to processor B doesn't make
much difference.  Obviously, there are cache issues (since that's part of
context). This kind of thing is why multiprocessor kernels are non-trivial.

>I understand that MPI works identically from the programmer's
>perspective between CPU-cores on a shared memory computer as between
>CPU-cores on separate servers.  However, the performance (low latency
>and high bandwidth) of these communications within a single shared
>memory system is vastly higher than between any separate servers, which
>would rely on Infiniband or Ethernet.

Yes.  This is a problem with a simple interconnect model.. It doesn't
necessarily reflect the cost of the interconnect is different depending on
how far and how fast you're going.  That said, there is a fair amount of
research into this.  Hypercube processors had limited interconnects
between nodes (only nearest neighbor) and there are toroidal fabrics (2D
interconnects) as well.
>
>So even if you have, or are going to write, MPI-based software which can
>run on a cluster, there may be an argument for not building a cluster as
>such, but for building a single motherboard system with as many as 64
>CPU cores.

Sure.. If your problem is of a size that it can be solved by a single box,
then that's usually the way to go.  (It applies in areas outside of
computing.. Better to have one big transmitter tube than lots of little
ones). But it doesn't scale.  The instant the problem gets too big, then
you're stuck.  The advantage of clusters is that they are scalable.  Your
problem gets 2x bigger, in theory, you add another N nodes and you're
ready to go (Amdahl's law can bite you though).

There's even been a lot of discussion over the years on this list about
the optimum size cluster to build for a big task, given that computers are
getting cheaper/more powerful.  If you've got 2 years worth of computing,
do you buy a computer today that can finish the job in 2 years, or do you
do nothing for a year and buy a computer that is twice as fast in a year.

>
>I think the major new big academic cluster projects focus on getting as
>many CPU cores as possible into a single server, while minimising power
>consumption per unit of compute power, and then hooking as many as
>possible of these servers together with Infiniband.

That might be an aspect of trying to make a general purpose computing
resource within a specified budget.

>
>Here is a somewhat rambling discussion of my own thoughts regarding
>clusters and multi-core machines, for my own purposes.  My interests in
>high performance computing involve music synthesis and physics simulation.
>
>There is an existing, single-threaded (written in C, can't be made
>multithreaded in any reasonable manner) music synthesis program called
>Csound.  I want to use this now, but as a language for synthesis, I
>think it is extremely clunky.  So I plan to write my own program - one
>day . . .   When I do, it will be written in C++ and multithreaded, so
>it will run nicely on multiple CPU-cores in a single machine.  Writing
>and debugging a multithreaded program is more complex than doing so for
>a single-threaded program, but I think it will be practical and a lot
>easier than writing and debugging an MPI based program running either on
>on multiple servers or on multiple CPU-cores on a single server.

Maybe, maybe not.  How is your interthread communication architecture
structured?  Once you bite the bullet and go with a message passing model,
it's a lot more scalable, because you're not doing stuff like "shared
memory".

>
>I want to do some simulation of electromagnetic wave propagation using
>an existing and widely used MPI-based (C++, open source) program called
>Meep.  This can run as a single thread, if there is enough RAM, or the
>problem can be split up to run over multiple threads using MPI
>communication between the threads.  If this is done on a single server,
>then the MPI communication is done really quickly, via shared memory,
>which is vastly faster than using Ethernet or Inifiniband to other
>servers.  However, this places a limit on the number of CPU-cores and
>the total memory.  When simulating three dimensional models, the RAM and
>CPU demands can easily become extremely demanding.  Meep was written to
>split the problem into multiple zones, and to work efficiently with MPI.

As you note, this is advantage of setting up a message passing
architecture from the beginning.. It works regardless of the scale/method
of message passing.  There *are* differences in performance.

>
>Ten or 15 years ago, the only way to get more compute power was to build
>a cluster and therefore to write the software to use MPI.  This was
>because CPU-devices had a single core (Intel Pentium 3 and 4) and
>because it was rare to find motherboards which handled multiple such
>chips.

Yes

>
>The next step would be to get a 4 socket motherboard from Tyan or
>SuperMicro for $800 or so and populate it with 8, 12 or (if money
>permits) 16 core CPUs and a bunch of ECC RAM.
>
>My forthcoming music synthesis program would run fine with 8 or 16GB of
>RAM.  So one or two of these 16 (2 x 8) to 64 (4 x 16) core Opteron
>machines would do the trick nicely.

>