[Beowulf] mem consumption strategy for HPC apps?

Sun Apr 17 12:46:46 PDT 2005

Mark Hahn wrote:
>>What is the ideal way to manage memory consumption in HPC applications?
> 
> 
> run the right jobs on the right machines.
> 

But because memory is scarce one needs to have a good memory consumption
strategy. And memory is scarce otherwise out-of-core solvers (like for
instance used in NASTRAN) would not be necessary.

> 
>>For HPC applications, performance is everything. Next we all know about 
>>the famous performance-memory tradeoff which says that performance can 
>>be improved by consuming more memory and vice versa. Therefore HPC 
>>applications want to consume all available memory.
> 
> 
> this is simply untrue.  the HPC apps I deal with most fall into 
> two broad categories:
> 
> 	- montecarlo-type stuff, which tends to be incredibly small,
> 	even cache-resident, and which certainly does NOT want or need 
> 	more memory.
> 
> 	- physically-based simulations (cosmology, condensed matter, 
> 	materials, etc), which has a very clear memory requirement based
> 	on the model being sumulated.
> 
> MC-class stuff is almost insignificant in memory use and scales linearly.
> physically-based simulations tend to be limited by work, not space - yes,
> you need enough memory, but world-class problems will be pushing the
> boundaries of cpu/interconnect speed, not so much memory size.

direct solvers treating 1 million dofs (and a decent bandwith of course)
need a _lot_ of memory. Thus: out-of-core solvers are necessary.

> it's useful to state up-front what sizes we're talking about.  I say that 
> 1GB per cpu is the lower bound of what's reasonable today - just in terms 
> of packaging (2x512M dimms per cpu).  admittedly, Intel's approach might 
> get by with less.
> 
> but my biggest users are reasonably happy with 4GB/p today, maybe a little
> eager for 8GB/p tomorrow.  that's reasonably congruent with where the market
> is (4x1G dimms per opteron, for instance.)  MC people with very simple models
> are around 4MB/p (that's mega), and people with larger models (MC or not)
> are at around 400 MB/p.
> 

I'm assuming at least 1G also. But most have 4G per node and up. But 
again this is for direct solvers of big systems.

> 
>>But the performance-memory tradeoff as mentioned above supposes infinite 
>>memory and infinite memory bandwith. Because memory if finite, consuming 
>>more memory as physically available will result in swapping by the OS 
> 
> 
> swapping in HPC is a non-fatal error condition.  note that I also didn't
> mention "old-fashioned" applications which use tons of disk space to 
> cache partial results.  I would argue that the "old-fashioned" jeer is 
> at least partially justified by the assumption of disk-intensive apps 
> that they can't scale NCPUS.  yes, you can dress these apps up as
> "out-of-core", but I'm not so sure they really make sense today.
> 

The question is just: any out-of-core uses blocking and treats block per 
block. But how big should blocks ideally be. Can I take a block-size 
that is almost equal to my physical memory and thus relying on the rest 
of the app being swapped out (taking into account that bigger block size 
improves performance)?

> 
>>Knowing this we could say that HPC applications generally want to eat 
>>all available memory but not more. All available memory here means all 
> 
> 
> I don't believe this is a useful generalization.  applications have a 
> "natural" size.  applications which are blindly scaled up in data
> (without corresponding scaling of NCPUS) are highly suspect, IMO.
> 
> 
>>basic services because we suppose that HPC applications do not share 
>>their processor with other applications (to have the whole cache for 
>>itself).
> 
> 
> don't freak out about caches!  a cache flush is only a millisecond or so,
> which means that it's entirely reasonable to timeslice applications 
> on a fairly coarse granularity (theoretically, even just a few seconds
> would be enough to amortize the flush.)  admittedly, I'm ignoring the 
> difference between a cache's worth of isolated fetches and a (streaming)
> flush, but back-of-envelope numbers indicate this is not a big problem.
> 
> the real issue is that tight-coupled apps need gang scheduling.
> 

true. Although sequential batch-jobs are also queued.

> 
>>Well this is true for single-processor machines. On multi-proc 
>>machines (smp,numa) only a part of the physical memory can be consumed.
> 
> 
> depends on the machine.  even on an altix with 9M caches, flushing 
> would take ~2ms, so if you do it every 100 seconds, no one will notice.

but a typical timeslice is much shorter than 100 seconds. Additionally 
you're not taking into account the time you loose the time you need to 
rebuild your cache.

> 
> consider an opteron with 1M cache and 3 GB/s easily sustainable BW
> and 50 ns latency.  .3 ms for a streaming flush and .8 ms line-by-line.
> 
> 
>>loops (e.g. in the solver)? In the latter case, can we rely on the OS 
>>swapping out the inactive parts of our application to make space for the 
>>solver or would it be better that the application puts all 
>>data-structures that are not used in the solver on disk to make sure? 
> 
> 
> the OS has mediocre insight into which pages you really should have 
> resident vs on disk.  you probably *can* arrange your access patterns
> to be very simple (a single window moving through a large span of memory)
> which the VM will approximate.  but you're probably better off doing 
> it yourself.

OK, thanks. This was one of my main questions. So as you said before: 
the OS swapping an HPC app is a non-fatal error.

> 
> I question whether this approach makes that much sense, though,
> since CPUs are *not* all that expensive, relative to large amounts 
> of ram, and especially when considering the speed of disk.

It's about: how much effort must I do to put all my data, that is not 
used in the solver, on disk. If the OS does a very good job: why bother?
(of course I keep using an out-of-core solver, I'm only talking about 
the data in the rest of the app such as data on the mesh etc.)

> 
> 
>>OTOH if we want to limit the total memory consumption to 7.5GB, would it 
>>be best to allocate a memory-pool of 7.5GB and if the pool is full abort 
>>the application (after running for days)?
> 
> 
> depends on how effectively the VM can approximate the working set.  
> I say kill the application when its %CPU drops too low due to thrashing.
> and go talk to the job's owner to figure out a better way/place to run it.
> 

> regards, mark hahn.

thanks for the very interesting response!

toon

-- 
Check out our training program on acoustics
and register on-line at http://www.fft.be/?id=35