[Beowulf] mem consumption strategy for HPC apps?
hahn at physics.mcmaster.ca
Fri Apr 15 12:16:16 PDT 2005
> What is the ideal way to manage memory consumption in HPC applications?
run the right jobs on the right machines.
> For HPC applications, performance is everything. Next we all know about
> the famous performance-memory tradeoff which says that performance can
> be improved by consuming more memory and vice versa. Therefore HPC
> applications want to consume all available memory.
this is simply untrue. the HPC apps I deal with most fall into
two broad categories:
- montecarlo-type stuff, which tends to be incredibly small,
even cache-resident, and which certainly does NOT want or need
- physically-based simulations (cosmology, condensed matter,
materials, etc), which has a very clear memory requirement based
on the model being sumulated.
MC-class stuff is almost insignificant in memory use and scales linearly.
physically-based simulations tend to be limited by work, not space - yes,
you need enough memory, but world-class problems will be pushing the
boundaries of cpu/interconnect speed, not so much memory size.
it's useful to state up-front what sizes we're talking about. I say that
1GB per cpu is the lower bound of what's reasonable today - just in terms
of packaging (2x512M dimms per cpu). admittedly, Intel's approach might
get by with less.
but my biggest users are reasonably happy with 4GB/p today, maybe a little
eager for 8GB/p tomorrow. that's reasonably congruent with where the market
is (4x1G dimms per opteron, for instance.) MC people with very simple models
are around 4MB/p (that's mega), and people with larger models (MC or not)
are at around 400 MB/p.
> But the performance-memory tradeoff as mentioned above supposes infinite
> memory and infinite memory bandwith. Because memory if finite, consuming
> more memory as physically available will result in swapping by the OS
swapping in HPC is a non-fatal error condition. note that I also didn't
mention "old-fashioned" applications which use tons of disk space to
cache partial results. I would argue that the "old-fashioned" jeer is
at least partially justified by the assumption of disk-intensive apps
that they can't scale NCPUS. yes, you can dress these apps up as
"out-of-core", but I'm not so sure they really make sense today.
> Knowing this we could say that HPC applications generally want to eat
> all available memory but not more. All available memory here means all
I don't believe this is a useful generalization. applications have a
"natural" size. applications which are blindly scaled up in data
(without corresponding scaling of NCPUS) are highly suspect, IMO.
> basic services because we suppose that HPC applications do not share
> their processor with other applications (to have the whole cache for
don't freak out about caches! a cache flush is only a millisecond or so,
which means that it's entirely reasonable to timeslice applications
on a fairly coarse granularity (theoretically, even just a few seconds
would be enough to amortize the flush.) admittedly, I'm ignoring the
difference between a cache's worth of isolated fetches and a (streaming)
flush, but back-of-envelope numbers indicate this is not a big problem.
the real issue is that tight-coupled apps need gang scheduling.
> Well this is true for single-processor machines. On multi-proc
> machines (smp,numa) only a part of the physical memory can be consumed.
depends on the machine. even on an altix with 9M caches, flushing
would take ~2ms, so if you do it every 100 seconds, no one will notice.
consider an opteron with 1M cache and 3 GB/s easily sustainable BW
and 50 ns latency. .3 ms for a streaming flush and .8 ms line-by-line.
> loops (e.g. in the solver)? In the latter case, can we rely on the OS
> swapping out the inactive parts of our application to make space for the
> solver or would it be better that the application puts all
> data-structures that are not used in the solver on disk to make sure?
the OS has mediocre insight into which pages you really should have
resident vs on disk. you probably *can* arrange your access patterns
to be very simple (a single window moving through a large span of memory)
which the VM will approximate. but you're probably better off doing
I question whether this approach makes that much sense, though,
since CPUs are *not* all that expensive, relative to large amounts
of ram, and especially when considering the speed of disk.
> OTOH if we want to limit the total memory consumption to 7.5GB, would it
> be best to allocate a memory-pool of 7.5GB and if the pool is full abort
> the application (after running for days)?
depends on how effectively the VM can approximate the working set.
I say kill the application when its %CPU drops too low due to thrashing.
and go talk to the job's owner to figure out a better way/place to run it.
regards, mark hahn.
More information about the Beowulf