[Beowulf] first cluster

Thu Jul 15 18:29:59 PDT 2010

>> Disadvantage is of course, when the system runs out of
>> memory the oom-killer will look for an eligible process
>> to be killed to free up some space.
>
> That assumes that you are permitting your compute nodes
> to overcommit their memory, if you disable overcommit I
> believe that you will instead just get malloc()'s failing
> when there is nothing for them to grab.

yes.  actually, configuring memory and swap is an interesting topic.
the feature Chris is referring to is, I think, the vm.overcommit_memory
sysctl (and the associated vm.overcommit_ratio.)  every distro I've seen
leaves these at the default seting: vm.overcommit_memory=0.  this 
is basically the traditional setting that tells the kernel to feel free
to allocate way too much memory, and to resolve memory crunches via OOM
killing.  obviously, this isn't great, since it never tells apps to 
conserve memory (malloc returning zero), and often kills processes that
you're rather not be killed (sshd, other system daemons).  on clusters
where a node may be shared across users/jobs, OOM can result serious 
collateral damage...

we've used vm.overcommit_memory=2 fairly often.  in this mode, the kernel
limits its VM allocations to a combination of the size of ram and swap.
this is reflected in /proc/meminfo:CommitLimit which will be computed 
as /proc/meminfo:SwapTotal + vm.overcommit_ratio * /proc/meminfo:MemTotal.
/proc/meminfo:Committed_AS is the kernel's idea of total VM usage.

IMO, it's essential to also run with RLIMIT_AS on all processes.  this is 
basically a VM limit per process (not totalled across processes, though 
of course threads by definition share a single VM.)  you might be thinking
that RLIMIT_RSS would be better - indeed it would, but the kernel doesn't 
implement it.  basically, limiting RSS is a bit tricky because you have to
deal with how to count shared pages, and the limiting logic is going to 
slow down some important hot paths.  (unlike AS (vsz), which only needs logic
during explicit brk/mmap/munmap ops.)

of course, to be useful, this requires users to provide reasonable memory
limits at job-submission time.  (our user population is pretty diverse, and
isn't very good at doing wallclock limits, let alone "wizardly" issues like
VM footprint.)

batch systems often also provide their own resource management systems.
I'm not fond of putting much effort in this direction, since it's usually
based on a load-balancing model (which doesn't work if job memory use 
fluctuates), and upon on-node daemons which are assumed to be able to 
stay alive long enough to kill over-large job processes.  yes, one can harden
such system daemons by locking them into ram, but that's not an unalloyed win:
they'll probably be nontrivial in size, and such memory usage is unswapable,
even if some of the pages are never used...

anyway, back to the topic: it's eminently possible to run nodes without swap,
and reasonably safe to do so if your user community is not totally random,
and if you make smart use of vm.overcommit_memory=2 and RLIMIT_AS.  5 years
ago, running swapless was somewhat risky because the kernel was dramatically
better tested/tuned in a normal swap-able configuration.  my guess is that
the huge embedded ecosystem has made swapless more robust, especially if
you take the time to configure some basic sanity limits on user processes.

regards, mark hahn.