[Beowulf] Memory limit enforcement

Tue Oct 9 20:48:55 PDT 2007

On Monday 08 October 2007, Olli-Pekka Lehto wrote:
> Hello,
>
> I'm interested in hearing some best practice solutions in fine-grained
> management memory resources on clusters of SMPs. How do you enforce real
> memory usage inside (RHEL-based) cluster nodes running multiple serial
> jobs simultaneously? More specifically, how to do this efficiently when
> some of the jobs map copious amounts of virtual memory but have only a
> fraction of it resident at any given time?
>
> As SMP systems keep getting constantly fatter (and the potential for
> users interfering with each others' jobs increasing) it would be great
> to have something like AIX's WLM (Workload Manager) on Linux to
> effectively manage intra-SMP resources.
>
> Olli-Pekka

Ah, a family of issues near to my heart. :)

I'll ask a broader question:  How do you enforce real memory usage in modern 
Linux *at all*?

We were interested in this because we were having user jobs regularly cause 
nodes to go into an Out Of Memory (OOM) state, triggering the kernel's 
oom_killer.  The oom_killer sometime would kill system processes, which 
sometimes caused subsequent jobs to die.  Even if subsequent jobs didn't 
die, recovery required that we manually close the node, reboot it when 
running jobs finished, then reopen it.  This gets to be pretty dreary after 
a while.

Our problem is somewhat different from your interests, but some of the same 
issues come into play.  See below for the partially satisfying solution 
that we put in place for our OOM woes.  First a review of the problem 
landscape as I understand it.

You can try to enforce memory limits with a daemon, but you risk missing 
important events, including a badly behaved process suddenly using a whole 
lot of memory all at once.  If that happens, your daemon is nearly useless 
since swapping and/or oom_killer will be running, and not your daemon.  
Your node may lock up for a while, which was what the daemon was supposed 
to prevent.

I think you really want to do it in the kernel, so that badly behaved 
requests for memory (allocation and/or writing) can be cut off before they 
affect anyone else.

But the kernel doesn't really enforce anything useful.  It doesn't enforce a 
resident set size (RSS) limit, even though setrlimit() will let you request 
such a limit.  As I understand it, modern Linux doesn't even try to track 
RSS, because semantics of RSS are unclear given modern memory management 
methods.

RSS probably isn't even what you want -- you probably want to limit the 
amount of physical memory used, keeping the sum of the limits around the 
amount of total RAM, to avoid swapping.  There is no way to communicate 
this limit to the kernel; I suspect it doesn't even track it except 
globally.

The kernel *is* able to enforce the amount of virtual memory allocated per 
process (set with setrlimit()), but as you noted, that is of limited value 
when different applications can have very different overcommit percentages 
(virtual memory allocated beyond the amount actually used).

But take a step back from considering the limits you can place on a given 
process.  You probably want a policy that limits memory use at the job 
level, not at the process level, regardless of whether you have one job or 
multiple jobs running on a node.  There is no kernel mechanism for that 
either.

Seems your best bet might be to write a daemon, and hope that actual use 
patterns don't cause swapping or OOM before the daemon can act.

To end our OOM problems, we took a different route.  The job launch 
mechanism (via LSF) sets the per-process virtual-memory-allocation limit on 
each user job process.  We can prevent OOM this way, unless a job both uses 
non-standard job launch methods and has runaway memory use (which is rare 
in our experience).

Other weaknesses of our method include:

* It does not prevent heavy swapping (which would be nice to have, but at 
least the user suffers the consequences most).

* It can prevent a job from using all available RAM if the job has a larger 
overcommit than our algorithm assumes.

* When the VM allocation limit is reached, the errors are often cryptic.  
Nothing appears in syslog (unlike segfaults, which are logged at least on 
x86_64) -- the kernel patch to enable logging seems likely pretty trivial, 
but stock kernels don't do it.  A malloc() will return ENOMEM, which many 
programs and libraries don't handle properly (or indeed handle at all -- 
how many programmers omit checking the return value or errno?), so the user 
doesn't get a useful error message.  A failed stack expansion will cause a 
segfault (as I recall), which is also cryptic to the user.  At least 
segfaults get logged...

I'd love to hear other approaches to this family of problems.

David