[Beowulf] Mark Hahn's Beowulf/Cluster/HPC mini-FAQ for newbies & some further thoughts

Sat Nov 3 18:55:34 PDT 2012

On 2012-10-31 CJ O'Reilly asked some pertinent questions about HPC,
Cluster, Beowulf computing from the perspective of a newbie:

  http://www.beowulf.org/pipermail/beowulf/2012-October/030359.html

The replies began in the "Digital Image Processing via
HPC/Cluster/Beowulf - Basics"  thread on the November pages of the
Beowulf Mailing List archives.  (The archives strip off the "Re: " from
the replies' subject line.)

  http://www.beowulf.org/pipermail/beowulf/2012-November/thread.html

Mark Hahn wrote a response which I and MC O'Reilly found most
informative.  For the benefit of other lost souls wandering into this
fascinating field, I nominate Mark's reply as a:

   Beowulf/Cluster/HPC FAQ for newbies
   Mark Hahn 2012-11-03
   http://www.beowulf.org/pipermail/beowulf/2012-November/030363.html

Quite likely the other replies and those yet to come will be well worth
reading too, so please check the November thread index, and potentially
December if the discussions continue there.  Mark wrote a second
response exploring options for some hypothetical image processing problems.

Googling for an HPC or Cluster FAQ lead to FAQs for people who are users
of specific clusters.  I couldn't easily find a FAQ suitable for people
with general computer experience wondering what they can and can't do,
or at least what would be wise to do or not do, with clusters, High
Performance Computing etc.

The Wikipedia articles:

  http://en.wikipedia.org/wiki/High-performance_computing
  http://en.wikipedia.org/wiki/High-throughput_computing

are of general interest.

Here are some thoughts which might complement what Mark wrote.  I am a
complete newbie to this field and I hope more experienced people will
correct or expand on the following.

HPC has quite a long history and my understanding of the Beowulf concept
is to make clusters using easily available hardware and software, such
as operating systems, servers and interconnects.  The Interconnects, I
think are invariably Ethernet, since anything else is, or at least was
until recently, exotic and therefore expensive.

Traditionally - going back a decade or more - I think the assumptions
have been that a single server has one or maybe two CPU cores and that
the only way to get a single system to do more work is to interconnect a
large number of these servers with a good file system and fast
inter-server communications so that suitably written software (such as
something written to use MPI so the instances of the software on
multiple servers and/or CPU cores and work on the one problem
efficiently) can do the job well.  All these things - power supplies,
motherboards, CPUs, RAM, Ethernet (one or two 1GBps Ethernet on the
motherboard) and Ethernet switches are inexpensive and easy to throw
together into a smallish cluster.

However, creating software to use it efficiently can be very tricky -
unless of course the need can be satisfied by MPI-based software which
someone else has already written.

For serious work, the cluster and its software needs to survive power
outages, failure of individual servers and memory errors, so ECC memory
is a good investment . . . which typically requires more expensive
motherboards and CPUs.

I understand that the most serious limitation of this approach is the
bandwidth and latency (how long it takes for a message to get to the
destination server) of 1Gbps Ethernet.  The most obvious alternatives
are using multiple 1Gbps Ethernet connections per server (but this is
complex and only marginally improves bandwidth, while doing little or
nothing for latency) or upgrading to Infiniband.  As far as I know,
Infiniband is exotic and expensive compared to the mass market
motherboards etc. from which a Beowulf cluster can be made.  In other
words, I think Infiniband is required to make a cluster work really
well, but it does not not (yet) meet the original Beowulf goal of being
inexpensive and commonly available.

I think this model of HPC cluster computing remains fundamentally true,
but there are two important developments in recent years which either
alter the way a cluster would be built or used or which may make the
best solution to a computing problem no longer a cluster.  These
developments are large numbers of CPU cores per server, and the use of
GPUs to do massive amounts of computing, in a single inexpensive graphic
card - more crunching than was possible in massive clusters a decade
earlier.

The ideal computing system would have a single CPU core which could run
at arbitrarily high frequencies, with low latency, high bandwidth,
access to an arbitrarily large amount of RAM, with matching links to
hard disks or other non-volatile storage systems, with a good Ethernet
link to the rest of the world.

While CPU clock frequencies and computing effort per clock frequency
have been growing slowly for the last 10 years or so, there has been a
continuing increase in the number of CPU cores per CPU device (typically
a single chip, but sometimes multiple chips in a device which is plugged
into the motherboard) and in the number of CPU devices which can be
plugged into a motherboard.

Most mass market motherboards are for a single CPU device, but there are
a few two and four CPU motherboards for Intel and AMD CPUs.

It is possible to get 4 (mass market) 6, 8, 12 or sometimes 16 CPU cores
per CPU device.  I think the 4 core i7 CPUs or their ECC-compatible Xeon
equivalents are marginally faster than those with 6 or 8 cores.

In all cases, as far as I know, combining multiple CPU cores and/or
multiple CPU devices results in a single computer system, with a single
operating system and a single body of memory, with multiple CPU cores
all running around in this shared memory.  I have no clear idea how each
CPU core knows what the other cores have written to the RAM they are
using, since each core is reading and writing via its own cache of the
memory contents.  This raises the question of inter-CPU-core
communications, within a single CPU chip, between chips in a multi-chip
CPU module, and between multiple CPU modules on the one motherboard.

It is my impression that the AMD socket G34 Opterons:

  http://en.wikipedia.org/wiki/Opteron#Socket_G34

Magny-Cours Opteron 6100 and Interlagos 6200 devices solve these
problems better than current Intel chips.  For instance, it is possible
to buy 2 and 4 socket motherboards (though they are not mass-market, and
may require fancy power supplies) and then to plug in 8, 12 or 16 core
CPU devices into them, with a bunch of DDR3 memory (ECC or not) and so
make yourself a single shared memory computer system with compute power
which would have only been achievable with a small cluster 5 years ago.

I understand the G34 Opterons have a separate Hypertransport link
between any one CPU-module and the other three CPU modules on the
motherboard on a 4 CPU-module motherboard.

I understand that MPI works identically from the programmer's
perspective between CPU-cores on a shared memory computer as between
CPU-cores on separate servers.  However, the performance (low latency
and high bandwidth) of these communications within a single shared
memory system is vastly higher than between any separate servers, which
would rely on Infiniband or Ethernet.

So even if you have, or are going to write, MPI-based software which can
run on a cluster, there may be an argument for not building a cluster as
such, but for building a single motherboard system with as many as 64
CPU cores.

I think the major new big academic cluster projects focus on getting as
many CPU cores as possible into a single server, while minimising power
consumption per unit of compute power, and then hooking as many as
possible of these servers together with Infiniband.

Here is a somewhat rambling discussion of my own thoughts regarding
clusters and multi-core machines, for my own purposes.  My interests in
high performance computing involve music synthesis and physics simulation.

There is an existing, single-threaded (written in C, can't be made
multithreaded in any reasonable manner) music synthesis program called
Csound.  I want to use this now, but as a language for synthesis, I
think it is extremely clunky.  So I plan to write my own program - one
day . . .   When I do, it will be written in C++ and multithreaded, so
it will run nicely on multiple CPU-cores in a single machine.  Writing
and debugging a multithreaded program is more complex than doing so for
a single-threaded program, but I think it will be practical and a lot
easier than writing and debugging an MPI based program running either on
on multiple servers or on multiple CPU-cores on a single server.

I want to do some simulation of electromagnetic wave propagation using
an existing and widely used MPI-based (C++, open source) program called
Meep.  This can run as a single thread, if there is enough RAM, or the
problem can be split up to run over multiple threads using MPI
communication between the threads.  If this is done on a single server,
then the MPI communication is done really quickly, via shared memory,
which is vastly faster than using Ethernet or Inifiniband to other
servers.  However, this places a limit on the number of CPU-cores and
the total memory.  When simulating three dimensional models, the RAM and
CPU demands can easily become extremely demanding.  Meep was written to
split the problem into multiple zones, and to work efficiently with MPI.

For Csound, my goal is to get a single piece of music synthesised in as
few hours as possible, so as to speed up the iterative nature of the
cycle: alter the composition, render it, listen to it and alter the
composition again.  Probably the best solution to this is to buy a
bleeding edge clock-speed (3.4GHz?) 4-core Intel i7 CPU and motherboard,
which are commodity consumer items for the gaming market.  Since Csound
is probably limited primarily by integer and floating point processing,
rather than access to large amounts of memory or by reading and writing
files, I could probably render three or four projects in parallel on a
4-core i7, with each rendering nearly as fast as if just one was being
rendered.  However, running four projects in parallel is of only
marginal benefit to me.

If I write my own music synthesis program, it will be in C++ and will be
designed for multithreading via multiple CPU-cores in a shared memory
(single motherboard) environment.  It would be vastly more difficult to
write such a program using MPI communications.  Music projects can
easily consume large amounts of CPU and memory power, but I would rather
concentrate on running a hot-shot single motherboard multi-CPU shared
memory system and writing multi-threaded C++ than to try to write it for
MPI.  I think any MPI-based software would be more complex and generally
slower, even on a single server (MPI communications via shared memory
rather than Ethernet/Infiniband) than one which used multiple threads
each communicating via shared memory.

Ten or 15 years ago, the only way to get more compute power was to build
a cluster and therefore to write the software to use MPI.  This was
because CPU-devices had a single core (Intel Pentium 3 and 4) and
because it was rare to find motherboards which handled multiple such chips.

Now, with 4 CPU-module motherboards it is totally different.  A starting
point would be to get a 2-socket motherboard and plug 8-core Opterons
into it.  This can be done for less than $1000, not counting RAM and
power supply.  For instance the Asus KGPE-D16 can be bought for $450.  8
core 2.3GHz G34 Opterons can be found on eBay for $200 each.

I suspect that the bleeding edge mass-market 4 core Intel i7 CPUs are
probably faster per core than these Opterons.  I haven't researched
this, but they are a much faster clock-rate at 3.6GHz and the CPU design
is the latest product of Intel's formidable R&D.  On the other hand, the
Opterons have Hypertransport and I think have a generally strong
floating point performance.  (I haven't found benchmarks which
reasonably compare these.)

I guess that many more 8 and 12 core Opterons may come on the market as
people upgrade their existing systems to use the 16 core versions.

The next step would be to get a 4 socket motherboard from Tyan or
SuperMicro for $800 or so and populate it with 8, 12 or (if money
permits) 16 core CPUs and a bunch of ECC RAM.

My forthcoming music synthesis program would run fine with 8 or 16GB of
RAM.  So one or two of these 16 (2 x 8) to 64 (4 x 16) core Opteron
machines would do the trick nicely.

This involves no cluster, HPC or fancy interconnect techniques.

Meep can very easily be limited by available RAM.  A cluster solves
this, in principle, since it is easy (in principle) to get an arbitrary
number of servers and run a single Meep project on all of them.
However, this would be slow compared to running the whole thing on
multiple CPU-cores in a single server.

I think the 4 socket G43 Opteron motherboards are probably the best way
to get a large amount of RAM into a single server.  Each CPU-module
socket has its own set of DDR3 RAM, and there are four of these.  The
Tyan S8812:

http://www.tyan.com/product_SKU_spec.aspx?ProductType=MB&pid=670&SKU=600000180

has 8 memory slots per CPU-module socket.  Populated with 32 x 8GB ECC
memory at about $80 each, this would be 256GB of memory for $2,560.

As far as I know, if the problem requires more memory than this, then I
would need to use multiple servers in a cluster with MPI communications
via Ethernet or Infiniband.

However, if the problem can be handled by 32 to 64 CPU-cores with 256GB
of RAM, then doing it in a single server as described would be much
faster and generally less expensive than spreading the problem over
multiple servers.

The above is a rough explanation for why the increasing number of
CPU-cores per motherboard, together with increased number of DIMM slots
per motherboard with increased affordable memory per DIMM slot means
that many projects which in the past could only be handled with a
cluster, can now be handled faster and less expensively with a single
server.

This "single powerful multi-core server" approach is particularly
interesting to me regarding writing my own programs for music synthesis
program or physics simulation.  The simplest approach would be write the
software in single-threaded C++.  However, that won't make use of the
CPU power, so I need to use the inbuilt multithreading capabilities of
C++.  This requires a more complex program design, but I figure I can
cope with this.

In principle I could write a single-thread physics simulation program
and access massive memory via a 4 CPU-module Opteron motherboard, but
any such program would be performance limited by the CPU speed, so it
would make sense to split it up over multiple CPU-cores with multithreading.

For my own projects, I think this is the way forward.  As far as I can
see, I won't need more memory than 256GB and more CPU power than I can
get on a single motherboard.  If I did, then I would need to write or
use MPI-based software and build a cluster - or get some time on an
existing cluster.

There is another approach as well - writing software to run on GPUs.
Modern mass-market graphics cards have dozens or hundreds of CPUs in
what I think is a shared, very fast, memory environment.  These CPUs are
particularly good at floating point and can be programmed to do all
sorts of things, but only using a specialised subset of C / C++.  I want
to write in full C++.  However, for many people, these GPU systems are
by far the cheapest form of compute power available.  This raises
questions of programming them, running several such GPU board in a
single server, and running clusters of such servers.

The recent thread on the Titan Supercomputer exemplifies this approach -
get as many CPU-cores and GPU-cores as possible onto a single
motherboard, in a reasonably power-efficient manner and then wire as
many as possible together with Infiniband to form a cluster.

Mass market motherboards and graphics cards with Ethernet is arguably
"Beowulf".  If and when Infiniband turns up on mass-market motherboards
without a huge price premium, that will be "Beowulf" too.

I probably won't want or need a cluster, but I lurk here because I find
it interesting.

Multi-core conventional CPUs (64 bit Intel x86 compatible, running
GNU-Linux) and multiple such CPU-modules on a motherboard are chewing
away at the bottom end of the cluster field.  Likewise GPUs if the
problem can be handled by these extraordinary devices.

With clusters, since the inter-server communication system (Infiniband
or Ethernet) - with the accompanying software requirements (MPI and
splitting the problem over multiple servers which do not share a single
memory system) is the most serious bottleneck, the best approach seems
to be to make each server as powerful as possible, and then use as few
of them as necessary.

 - Robin         http://www.firstpr.com.au