[Beowulf] [External] numad?
pbisbal at pppl.gov
Tue Jan 18 19:56:43 UTC 2022
I turn it off. When I had it on, it would cause performance to tank.
Doing some basic analysis, it appeared numad would move all the work to
a single core, leaving all the others idle. Without knowing the inner
workings of numad, my guess is that it saw the processes accessing the
same region of memory, so moved all the processes to the core "closest"
to that memory.
I didn't do any in-depth analysis, but turning off numad definitely
fixed that problem. The problem first appeared with a user code, and I
was able to reproduce it with HPL. It took 10 - 20 minutes for numad to
start migrating processes to the same core, so smaller "test" jobs
didn't trigger the behavior, causing first attempts at reproducing it
were unsuccessful. It wasn't until I ran "full" HPL tests on a node that
I was to reproduce the problem.
I think I used turbostat or something like that to watch the load and/or
processor freqs on the individual cores.
On 1/18/22 1:18 PM, Michael Di Domenico wrote:
> does anyone turn-on/off numad on their clusters? I'm running RHEL7.9
> on Intel CPU's and seeing a heavy performance impact on MPI jobs when
> running numad.
> diagnosis is pretty prelim right now, so i'm light on details. when
> running numad i'm seeing MPI jobs stall while numad pokes at the job.
> the stall is notable, like 10-12 seconds
> it's particularly interesting because if one rank stalls while numad
> runs, the others wait. once it frees they all continue, but then
> another rank gets hit, so i end up seeing this cyclic stall
> like i said i'm still looking into things, but i curious what
> everyone's take on numa is. my consensus is we probably don't even
> really need it since slurm/openmpi should be handling process
> placement anyhow
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
More information about the Beowulf