[Beowulf] Scheduler question -- non-uniform memory allocation to MPI

Sat Aug 1 15:24:18 PDT 2015

On Thu, 30 Jul 2015 at 11:34 -0000, Tom Harvill wrote:

> We run SLURM with cgroups for memory containment of jobs.  When
> users request resources on our cluster many times they will specify
> the number of (MPI) tasks and memory per task.  The reality of much
> of the software that runs is that most of the memory is used by MPI
> rank 0 and much less on slave processes.  This is wasteful and
> sometimes causes bad outcomes (OOMs and worse) during job runs.

I'll note that this problem also can occur in Grid Engine and OpenMPI.

We would get user reports of random job failures.  Sometimes the job
would run and other times it would fail.

We normally run allowing shared node access and the cases I've seen
with problems were with a highly fragmented cluster with tasks spread
1-2 per node.  Having the job request exclusive nodes (8 cores) was
generally enough to consolidate the qrsh processes from ~200 to ~50
which provided enough headroom on the master process.

The times I've observed have been due to the MPI startup process which
spawns a qrsh/ssh login from the master node to each of the slave
nodes (multiple MPI ranks on a slave share the same qrsh connection).
The memory for all of these qrsh processes on the master node can
eventually add up to be enough to cause out of memory conditions.

This "solution" (workaround) has been good enough for our impacted
users so far.  Eventually without other changes this problem will
return and not have as simple a solution.

Stuart
-- 
I've never been lost; I was once bewildered for three days, but never lost!
                                        --  Daniel Boone