[Beowulf] mem consumption strategy for HPC apps?

Mon Apr 18 07:54:38 PDT 2005

On Mon, 18 Apr 2005, Toon Knapen wrote:

> It's true that there is no 'general case' both OTOH all developers of 
> HPC applications may learn a lot of each other. It's a pitty there is 
> little discussion on HPC application and/or library design, which is of 
> course OT for the beowulf list (it's just a general remark), except for 
> a few hot topics (such as beowulf).

It's not off-topic, its perfectly on topic.  Its just that the GENERAL
discussion is too broad to be useful to most people.  However, everybody
benefits from specific, focussed discussions on this very topic and they
occur fairly often if you look at the list archives.

I think what people are saying is that "what do you do if your
application is so big that it runs out of main memory on a HPC compute
cluster" is so broad that it cannot be answered (although I gave it a
very general shot in my first response:-).  Or rather, the answer is
trim it down to fit or prepare to work pretty hard on THIS task as there
is no simple answer.

However, if you ask instead something like "I'm thinking of splitting up
task X that is too big to fit in core on my system to run in parallel on
a cluster; is this a good idea or should I rely on VM and swap instead?"
then you might get some very focussed and useful responses that suggest
things like comparing the latency and bw hits associated with disk, the
extra overhead associated with the OS determining what chunks of data
need to be swapped in and out, and the IPC costs involved in accessing
the same data over a network.  You might also get folks that ask you if
your TASK can be split up or if only its DATA can be split up -- most
people who use clusters use them to split up the actual work not just
the data (although there can be benefit from either or both).  This
would then lead to an entirely constructive discussion, totally on
topic, of parallel scaling, how best to measure/estimate the IPC cost of
various partitionings and possible network hardware and balance them
against your CPU/memory speeds and capacities.  This in turn would
ultimately help you parallelize your application AND engineer a cluster
on which a parallelization would yield linear speedup for at least some
range of sizes.

So to get useful help, ask specific questions about your specific
application.  It may be that nobody on-list can help you, if nobody is
doing that particular thing or anything "like" that.  However, there are
lots of things people can/will do to help if you then get very specific
about your application's operational cycle, as well as generic resources
you can read online on parallel programming and cluster computing
(notably Ian Foster's online book on the former and my own online book
on the latter, both linked to e.g. http://www.phy.duke.edu/brahma and/or
my personal home page).

To give you a single example of how things work, let me outline the way
a person doing some sort of lattice problem might proceed.  For whatever
reason, they want to do a lattice of 1024^3 sites.  On each site there
is a struct containing eight double precision (8 byte) variables.  Their
primary lattice thus occupies 64 GB of space (plus sundry change for the
code itself, operational variables and the like), and they have a
cluster of 64 2 GB nodes on which to run it.

SO, they say "gee, if I put 1 GB blocks of the lattice on each node, it
fits".  A bit of arithmetic convinces one that this is blocks 256 to the
side (since 4^3 = 64).  Now, in order to advance the computation each
site needs to communicate with its nearest neighbors.  For all the
interior sites this is a local memory hit.  For all the boundary sites
(on all six faces of the partitioned cube) this involves two-way IPCs
with the nodes that contains the neighboring cubes.  On each face there
are 256^2 = 64K sites, eight bytes each is 512 KB, six faces = 3 MB of
IPC's per node per step.

The COST of this (in time) on (say) 100 BT is at least 100 usec per face
to start talking to a node plust 0.5/10 MB/sec = 0.05 seconds, bandwidth
dominated.  To do all six faces requires 0.3 seconds.  If you organize
your code carefully you can probably find a pattern that does all of the
bidirectional communications in parallel for all nodes in the SAME 0.3
seconds.

The other thing to compute is how long it takes to "take a step" on a
node, that is, do all the computations required on the GB of data local
to the node.  If we assume that the site update computation is
complicated and involves e.g. evaluating a transcendental function per
double precision node variable, chances are good that site updates per
node will take order of seconds -- the longer the better from the point
of view of parallel scaling.  OTOH, it might be that each site update
takes only a very small time so that updating an entire node takes only
order of one second.

Even in the latter case we do 64 1 GB units of work in 1.3 seconds
compared to 64 seconds on a big memory machine for near linear speedup.
The scaling of the computational burden with memory size and the ratio
of surface (for IPCs) to volume (for the computation) gives us near
linear speedup until the cost of IPCs is GREATER than the cost of
updating each node in parallel.  Paradoxically, making the computation
smaller here is what shifts it into a poorer scaling regime -- a lattice
that was only 10x10x10 would never scale out to 64 nodes because the
communications latency alone would likely exceed the computational time
per node.

It is interesting to contemplate what happens if each site needs to
communicate with ALL other sites in the computation.  In that case
updating any site has to loop over ALL other sites, or basically 64 GB
worth of data.  In this case a lattice decomposition scales very
differently -- each node does a "fast" loop over the local sites and
then starts a systematic exchange of data with other sites.  Node memory
requirements may double as well, as one may have to buffer the initial
state of each node's lattice to communicate to all the other nodes,
plus provide a buffer for incoming node data.  IPCs now involve 32 pairs
of 1 GB exchanges or an irrelevant latency hit, 1 GB/10 MB/sec = 100
seconds per pair, 3200 seconds (or almost an hour) per update of all
sites.  Unless I'm making egregious arithmetic errors again (very
possible as longtime list humans know:-).

This can STILL be very profitable for several reasons.  One is that
doing the full 64 GB computation still takes a very long time on our
mythical big memory machine.  One is now doing a double loop and
computation time now scales like the number of nodes squared.  The
mythical machine probably needs 128+ GB of core just to hold the
original and updated data, which is even more outrageous than 64 3-4 GB
nodes.  All of which favor decomposition.

Another is that even if the computation required to update a site is
SMALL so that the computation takes 64 x longer or even worse to
complete than it would on a mythical 128 GB machines, this is INFINITELY
FASTER than it would take to complete on a MYTHICAL 128 GB machine as
there is no such beast.  History is replete with examples of
computations that were performed slowly and tediously because they were
valuable enough to justify it even if they didn't "fit" into the
"comfortable" regime.  Newton's hand computations of orbits, Kepler's
long and tedious work with Brahe's observations for example.  You may
lose horribly on parallel scaling but be able to complete the
computation, and almost certainly can complete the computation faster
than it would complete swapping even if the code runs fully serially on
a single system and the network is used only as a big virtual memory.

Finally, we're estimating using 100 BT ethernet, which is an old pokey
network.  We have more than an order of magnitude improvement to play
with at the IPC level, and maybe have more than one PARADIGM to play
with at the IPC level, if we change networks.  We we could conceivably
drop the overall IPC hit to 320 seconds (five minutes per site update).

Anyway, this is just an example of the kind of thing the list OFTEN
discusses -- if you want to see what people know about your particular
problem (which may be nothing:-) then describe your particular problem
in considerable detail.  To get the most help, do the preliminary
estimation "like" that above on your own -- your computation needs to
handle N entities, updating each of these entities depends on a subset
of M entities distributed in a local/nonlocal pattern, to conduct an
"independent" step of the computation based on an instantaneous snapshot
of the N entities requires T time per entity, and so on.

THEN one can look for partitionings of the data entities with favorable
ratios of computation to IPCs, or can at the very least estimate how
long a parallelized version will take to run for various cluster
architectures and partitionings and scales.  Then you can make rational
cost/benefit decisions, which is the best anyone can ever do -- maybe
your problem "can" be done profitably at the scale you want to run it
at, maybe it can't, but you'll fully understand either answer.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu