[Beowulf] Scheduler question -- non-uniform memory allocation to MPI
unl at harvill.net
Thu Jul 30 11:51:55 PDT 2015
Thank you for your reply. Yes, it's 'bad' code. It's WRF mostly. If
you have suggestions for that app I'm
all ears. We don't control the code-base. We're also not allowed to
update it except between projects
which is very infrequent.
It would be ideal if we could discretely control memory allocations to
individual processes within
a job but I don't expect it's possible. I wanted to reach out to this
list of experts in case we might be
The resistance comes from increased wait times as a result of staggered
serial jobs that prevent
allocations within a node exclusively. Yes, the users would probably
get better aggregate turnaround
time if they waited for node exclusivity...
On 7/30/2015 1:37 PM, Prentice Bisbal wrote:
> I don't want to be 'that guy', but it sounds like the root-cause of
> this problem is the programs themselves. A well-written parallel
> program should balance the workload and data pretty evenly across the
> nodes. Is this software written by your own researchers, open-source,
> or a commercial program? In my opinion, your efforts would be better
> spent fixing the program(s), if possible, than finding a scheduler
> with the feature you request, which I don't think exists.
> If you can't fix the software, I think you're out of luck.
> I was going to suggest requesting exclusive use of nodes (whole-node
> assignment) the easiest solution. What is the basis for the resistance?
> On 07/30/2015 11:34 AM, Tom Harvill wrote:
>> We run SLURM with cgroups for memory containment of jobs. When users
>> resources on our cluster many times they will specify the number of
>> (MPI) tasks and
>> memory per task. The reality of much of the software that runs is
>> that most of the
>> memory is used by MPI rank 0 and much less on slave processes. This
>> is wasteful
>> and sometimes causes bad outcomes (OOMs and worse) during job runs.
>> AFAIK SLURM is not able to allow users to request a different amount
>> of memory
>> for different processes in their MPI pool. We used to run
>> Maui/Torque and I'm fairly
>> certain that feature is not present in that scheduler either.
>> Does anyone know if any scheduler allows the user to request
>> different amounts of
>> memory per process? We know we can move to whole-node assignment to
>> this problem but there is resistance to that...
>> Thank you!
>> Tom Harvill
>> Holland Computing Center
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
More information about the Beowulf