[Beowulf] mem consumption strategy for HPC apps?
Toon Knapen
toon.knapen at fft.be
Sun Apr 17 12:46:46 PDT 2005
Mark Hahn wrote:
>>What is the ideal way to manage memory consumption in HPC applications?
>
>
> run the right jobs on the right machines.
>
But because memory is scarce one needs to have a good memory consumption
strategy. And memory is scarce otherwise out-of-core solvers (like for
instance used in NASTRAN) would not be necessary.
>
>>For HPC applications, performance is everything. Next we all know about
>>the famous performance-memory tradeoff which says that performance can
>>be improved by consuming more memory and vice versa. Therefore HPC
>>applications want to consume all available memory.
>
>
> this is simply untrue. the HPC apps I deal with most fall into
> two broad categories:
>
> - montecarlo-type stuff, which tends to be incredibly small,
> even cache-resident, and which certainly does NOT want or need
> more memory.
>
> - physically-based simulations (cosmology, condensed matter,
> materials, etc), which has a very clear memory requirement based
> on the model being sumulated.
>
> MC-class stuff is almost insignificant in memory use and scales linearly.
> physically-based simulations tend to be limited by work, not space - yes,
> you need enough memory, but world-class problems will be pushing the
> boundaries of cpu/interconnect speed, not so much memory size.
direct solvers treating 1 million dofs (and a decent bandwith of course)
need a _lot_ of memory. Thus: out-of-core solvers are necessary.
> it's useful to state up-front what sizes we're talking about. I say that
> 1GB per cpu is the lower bound of what's reasonable today - just in terms
> of packaging (2x512M dimms per cpu). admittedly, Intel's approach might
> get by with less.
>
> but my biggest users are reasonably happy with 4GB/p today, maybe a little
> eager for 8GB/p tomorrow. that's reasonably congruent with where the market
> is (4x1G dimms per opteron, for instance.) MC people with very simple models
> are around 4MB/p (that's mega), and people with larger models (MC or not)
> are at around 400 MB/p.
>
I'm assuming at least 1G also. But most have 4G per node and up. But
again this is for direct solvers of big systems.
>
>>But the performance-memory tradeoff as mentioned above supposes infinite
>>memory and infinite memory bandwith. Because memory if finite, consuming
>>more memory as physically available will result in swapping by the OS
>
>
> swapping in HPC is a non-fatal error condition. note that I also didn't
> mention "old-fashioned" applications which use tons of disk space to
> cache partial results. I would argue that the "old-fashioned" jeer is
> at least partially justified by the assumption of disk-intensive apps
> that they can't scale NCPUS. yes, you can dress these apps up as
> "out-of-core", but I'm not so sure they really make sense today.
>
The question is just: any out-of-core uses blocking and treats block per
block. But how big should blocks ideally be. Can I take a block-size
that is almost equal to my physical memory and thus relying on the rest
of the app being swapped out (taking into account that bigger block size
improves performance)?
>
>>Knowing this we could say that HPC applications generally want to eat
>>all available memory but not more. All available memory here means all
>
>
> I don't believe this is a useful generalization. applications have a
> "natural" size. applications which are blindly scaled up in data
> (without corresponding scaling of NCPUS) are highly suspect, IMO.
>
>
>>basic services because we suppose that HPC applications do not share
>>their processor with other applications (to have the whole cache for
>>itself).
>
>
> don't freak out about caches! a cache flush is only a millisecond or so,
> which means that it's entirely reasonable to timeslice applications
> on a fairly coarse granularity (theoretically, even just a few seconds
> would be enough to amortize the flush.) admittedly, I'm ignoring the
> difference between a cache's worth of isolated fetches and a (streaming)
> flush, but back-of-envelope numbers indicate this is not a big problem.
>
> the real issue is that tight-coupled apps need gang scheduling.
>
true. Although sequential batch-jobs are also queued.
>
>>Well this is true for single-processor machines. On multi-proc
>>machines (smp,numa) only a part of the physical memory can be consumed.
>
>
> depends on the machine. even on an altix with 9M caches, flushing
> would take ~2ms, so if you do it every 100 seconds, no one will notice.
but a typical timeslice is much shorter than 100 seconds. Additionally
you're not taking into account the time you loose the time you need to
rebuild your cache.
>
> consider an opteron with 1M cache and 3 GB/s easily sustainable BW
> and 50 ns latency. .3 ms for a streaming flush and .8 ms line-by-line.
>
>
>>loops (e.g. in the solver)? In the latter case, can we rely on the OS
>>swapping out the inactive parts of our application to make space for the
>>solver or would it be better that the application puts all
>>data-structures that are not used in the solver on disk to make sure?
>
>
> the OS has mediocre insight into which pages you really should have
> resident vs on disk. you probably *can* arrange your access patterns
> to be very simple (a single window moving through a large span of memory)
> which the VM will approximate. but you're probably better off doing
> it yourself.
OK, thanks. This was one of my main questions. So as you said before:
the OS swapping an HPC app is a non-fatal error.
>
> I question whether this approach makes that much sense, though,
> since CPUs are *not* all that expensive, relative to large amounts
> of ram, and especially when considering the speed of disk.
It's about: how much effort must I do to put all my data, that is not
used in the solver, on disk. If the OS does a very good job: why bother?
(of course I keep using an out-of-core solver, I'm only talking about
the data in the rest of the app such as data on the mesh etc.)
>
>
>>OTOH if we want to limit the total memory consumption to 7.5GB, would it
>>be best to allocate a memory-pool of 7.5GB and if the pool is full abort
>>the application (after running for days)?
>
>
> depends on how effectively the VM can approximate the working set.
> I say kill the application when its %CPU drops too low due to thrashing.
> and go talk to the job's owner to figure out a better way/place to run it.
>
> regards, mark hahn.
thanks for the very interesting response!
toon
--
Check out our training program on acoustics
and register on-line at http://www.fft.be/?id=35
More information about the Beowulf
mailing list