[Beowulf] mem consumption strategy for HPC apps?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Toon Knapen toon.knapen at fft.beSun Apr 17 12:46:46 PDT 2005
- Previous message: [Beowulf] Running on headnode only.
- Next message: [Beowulf] mem consumption strategy for HPC apps?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Mark Hahn wrote: >>What is the ideal way to manage memory consumption in HPC applications? > > > run the right jobs on the right machines. > But because memory is scarce one needs to have a good memory consumption strategy. And memory is scarce otherwise out-of-core solvers (like for instance used in NASTRAN) would not be necessary. > >>For HPC applications, performance is everything. Next we all know about >>the famous performance-memory tradeoff which says that performance can >>be improved by consuming more memory and vice versa. Therefore HPC >>applications want to consume all available memory. > > > this is simply untrue. the HPC apps I deal with most fall into > two broad categories: > > - montecarlo-type stuff, which tends to be incredibly small, > even cache-resident, and which certainly does NOT want or need > more memory. > > - physically-based simulations (cosmology, condensed matter, > materials, etc), which has a very clear memory requirement based > on the model being sumulated. > > MC-class stuff is almost insignificant in memory use and scales linearly. > physically-based simulations tend to be limited by work, not space - yes, > you need enough memory, but world-class problems will be pushing the > boundaries of cpu/interconnect speed, not so much memory size. direct solvers treating 1 million dofs (and a decent bandwith of course) need a _lot_ of memory. Thus: out-of-core solvers are necessary. > it's useful to state up-front what sizes we're talking about. I say that > 1GB per cpu is the lower bound of what's reasonable today - just in terms > of packaging (2x512M dimms per cpu). admittedly, Intel's approach might > get by with less. > > but my biggest users are reasonably happy with 4GB/p today, maybe a little > eager for 8GB/p tomorrow. that's reasonably congruent with where the market > is (4x1G dimms per opteron, for instance.) MC people with very simple models > are around 4MB/p (that's mega), and people with larger models (MC or not) > are at around 400 MB/p. > I'm assuming at least 1G also. But most have 4G per node and up. But again this is for direct solvers of big systems. > >>But the performance-memory tradeoff as mentioned above supposes infinite >>memory and infinite memory bandwith. Because memory if finite, consuming >>more memory as physically available will result in swapping by the OS > > > swapping in HPC is a non-fatal error condition. note that I also didn't > mention "old-fashioned" applications which use tons of disk space to > cache partial results. I would argue that the "old-fashioned" jeer is > at least partially justified by the assumption of disk-intensive apps > that they can't scale NCPUS. yes, you can dress these apps up as > "out-of-core", but I'm not so sure they really make sense today. > The question is just: any out-of-core uses blocking and treats block per block. But how big should blocks ideally be. Can I take a block-size that is almost equal to my physical memory and thus relying on the rest of the app being swapped out (taking into account that bigger block size improves performance)? > >>Knowing this we could say that HPC applications generally want to eat >>all available memory but not more. All available memory here means all > > > I don't believe this is a useful generalization. applications have a > "natural" size. applications which are blindly scaled up in data > (without corresponding scaling of NCPUS) are highly suspect, IMO. > > >>basic services because we suppose that HPC applications do not share >>their processor with other applications (to have the whole cache for >>itself). > > > don't freak out about caches! a cache flush is only a millisecond or so, > which means that it's entirely reasonable to timeslice applications > on a fairly coarse granularity (theoretically, even just a few seconds > would be enough to amortize the flush.) admittedly, I'm ignoring the > difference between a cache's worth of isolated fetches and a (streaming) > flush, but back-of-envelope numbers indicate this is not a big problem. > > the real issue is that tight-coupled apps need gang scheduling. > true. Although sequential batch-jobs are also queued. > >>Well this is true for single-processor machines. On multi-proc >>machines (smp,numa) only a part of the physical memory can be consumed. > > > depends on the machine. even on an altix with 9M caches, flushing > would take ~2ms, so if you do it every 100 seconds, no one will notice. but a typical timeslice is much shorter than 100 seconds. Additionally you're not taking into account the time you loose the time you need to rebuild your cache. > > consider an opteron with 1M cache and 3 GB/s easily sustainable BW > and 50 ns latency. .3 ms for a streaming flush and .8 ms line-by-line. > > >>loops (e.g. in the solver)? In the latter case, can we rely on the OS >>swapping out the inactive parts of our application to make space for the >>solver or would it be better that the application puts all >>data-structures that are not used in the solver on disk to make sure? > > > the OS has mediocre insight into which pages you really should have > resident vs on disk. you probably *can* arrange your access patterns > to be very simple (a single window moving through a large span of memory) > which the VM will approximate. but you're probably better off doing > it yourself. OK, thanks. This was one of my main questions. So as you said before: the OS swapping an HPC app is a non-fatal error. > > I question whether this approach makes that much sense, though, > since CPUs are *not* all that expensive, relative to large amounts > of ram, and especially when considering the speed of disk. It's about: how much effort must I do to put all my data, that is not used in the solver, on disk. If the OS does a very good job: why bother? (of course I keep using an out-of-core solver, I'm only talking about the data in the rest of the app such as data on the mesh etc.) > > >>OTOH if we want to limit the total memory consumption to 7.5GB, would it >>be best to allocate a memory-pool of 7.5GB and if the pool is full abort >>the application (after running for days)? > > > depends on how effectively the VM can approximate the working set. > I say kill the application when its %CPU drops too low due to thrashing. > and go talk to the job's owner to figure out a better way/place to run it. > > regards, mark hahn. thanks for the very interesting response! toon -- Check out our training program on acoustics and register on-line at http://www.fft.be/?id=35
- Previous message: [Beowulf] Running on headnode only.
- Next message: [Beowulf] mem consumption strategy for HPC apps?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
