[Beowulf] mem consumption strategy for HPC apps?

Fri Apr 15 12:16:16 PDT 2005

> What is the ideal way to manage memory consumption in HPC applications?

run the right jobs on the right machines.

> For HPC applications, performance is everything. Next we all know about 
> the famous performance-memory tradeoff which says that performance can 
> be improved by consuming more memory and vice versa. Therefore HPC 
> applications want to consume all available memory.

this is simply untrue.  the HPC apps I deal with most fall into 
two broad categories:

	- montecarlo-type stuff, which tends to be incredibly small,
	even cache-resident, and which certainly does NOT want or need 
	more memory.

	- physically-based simulations (cosmology, condensed matter, 
	materials, etc), which has a very clear memory requirement based
	on the model being sumulated.

MC-class stuff is almost insignificant in memory use and scales linearly.
physically-based simulations tend to be limited by work, not space - yes,
you need enough memory, but world-class problems will be pushing the
boundaries of cpu/interconnect speed, not so much memory size.

it's useful to state up-front what sizes we're talking about.  I say that 
1GB per cpu is the lower bound of what's reasonable today - just in terms 
of packaging (2x512M dimms per cpu).  admittedly, Intel's approach might 
get by with less.

but my biggest users are reasonably happy with 4GB/p today, maybe a little
eager for 8GB/p tomorrow.  that's reasonably congruent with where the market
is (4x1G dimms per opteron, for instance.)  MC people with very simple models
are around 4MB/p (that's mega), and people with larger models (MC or not)
are at around 400 MB/p.

> But the performance-memory tradeoff as mentioned above supposes infinite 
> memory and infinite memory bandwith. Because memory if finite, consuming 
> more memory as physically available will result in swapping by the OS 

swapping in HPC is a non-fatal error condition.  note that I also didn't
mention "old-fashioned" applications which use tons of disk space to 
cache partial results.  I would argue that the "old-fashioned" jeer is 
at least partially justified by the assumption of disk-intensive apps 
that they can't scale NCPUS.  yes, you can dress these apps up as
"out-of-core", but I'm not so sure they really make sense today.

> Knowing this we could say that HPC applications generally want to eat 
> all available memory but not more. All available memory here means all 

I don't believe this is a useful generalization.  applications have a 
"natural" size.  applications which are blindly scaled up in data
(without corresponding scaling of NCPUS) are highly suspect, IMO.

> basic services because we suppose that HPC applications do not share 
> their processor with other applications (to have the whole cache for 
> itself).

don't freak out about caches!  a cache flush is only a millisecond or so,
which means that it's entirely reasonable to timeslice applications 
on a fairly coarse granularity (theoretically, even just a few seconds
would be enough to amortize the flush.)  admittedly, I'm ignoring the 
difference between a cache's worth of isolated fetches and a (streaming)
flush, but back-of-envelope numbers indicate this is not a big problem.

the real issue is that tight-coupled apps need gang scheduling.

> Well this is true for single-processor machines. On multi-proc 
> machines (smp,numa) only a part of the physical memory can be consumed.

depends on the machine.  even on an altix with 9M caches, flushing 
would take ~2ms, so if you do it every 100 seconds, no one will notice.

consider an opteron with 1M cache and 3 GB/s easily sustainable BW
and 50 ns latency.  .3 ms for a streaming flush and .8 ms line-by-line.

> loops (e.g. in the solver)? In the latter case, can we rely on the OS 
> swapping out the inactive parts of our application to make space for the 
> solver or would it be better that the application puts all 
> data-structures that are not used in the solver on disk to make sure? 

the OS has mediocre insight into which pages you really should have 
resident vs on disk.  you probably *can* arrange your access patterns
to be very simple (a single window moving through a large span of memory)
which the VM will approximate.  but you're probably better off doing 
it yourself.

I question whether this approach makes that much sense, though,
since CPUs are *not* all that expensive, relative to large amounts 
of ram, and especially when considering the speed of disk.

> OTOH if we want to limit the total memory consumption to 7.5GB, would it 
> be best to allocate a memory-pool of 7.5GB and if the pool is full abort 
> the application (after running for days)?

depends on how effectively the VM can approximate the working set.  
I say kill the application when its %CPU drops too low due to thrashing.
and go talk to the job's owner to figure out a better way/place to run it.

regards, mark hahn.