[Beowulf] NUMA zone_reclaim_mode considered harmful?

Christopher Samuel samuel at unimelb.edu.au
Fri Sep 19 11:09:25 PDT 2014

Hi folks,

Over on the xCAT mailing list I've been involved in a thread relating
to diverse settings of zone_reclaim_mode across nodes in clusters.

It starts here with Stuarts good description of the problem and


I did some poking around on our systems and was able to confirm that
whilst our newer iDatplex (dx360 M4's) with SB CPUs all had
zone_reclaim_mode set to 0 our older iDataplex with Nehalems (dx360
M2's) were all 1, along with an older SGI UV10 (Westmere).

The clincher was the fact that on that same cluster we had some IBM
x3690 X5 with Maxx5's which boot an identical diskless image and they
had zone_reclaim_mode set to 0, not 1.

Turns out that this is indeed autotuned by older kernels, with this
text in the kernel Documentation/sysctl/vm.txt

# zone_reclaim_mode is set during bootup to 1 if it is determined
# that pages from remote zones will cause a measurable performance
# reduction. The page allocator will then reclaim easily reusable
# pages (those page cache pages that are currently not used) before
# allocating off node pages.

However, in 3.16 a patch was committed that disabled this auto-tuning,
turning off zone reclamation by default.

It's probably worth checking your own x86-64 systems to see if this
is set for you and benchmarking with it disabled if it is..

Here's that patch with the description..

commit 4f9b16a64753d0bb607454347036dc997fd03b82
Author: Mel Gorman <mgorman at suse.de>
Date:   Wed Jun 4 16:07:14 2014 -0700

    mm: disable zone_reclaim_mode by default
    When it was introduced, zone_reclaim_mode made sense as NUMA distances
    punished and workloads were generally partitioned to fit into a NUMA
    node.  NUMA machines are now common but few of the workloads are
    NUMA-aware and it's routine to see major performance degradation due to
    zone_reclaim_mode being enabled but relatively few can identify the
    Those that require zone_reclaim_mode are likely to be able to detect
    when it needs to be enabled and tune appropriately so lets have a
    sensible default for the bulk of users.
    This patch (of 2):
    zone_reclaim_mode causes processes to prefer reclaiming memory from
    local node instead of spilling over to other nodes.  This made sense
    initially when NUMA machines were almost exclusively HPC and the
    workload was partitioned into nodes.  The NUMA penalties were
    sufficiently high to justify reclaiming the memory.  On current machines
    and workloads it is often the case that zone_reclaim_mode destroys
    performance but not all users know how to detect this.  Favour the
    common case and disable it by default.  Users that are sophisticated
    enough to know they need zone_reclaim_mode will detect it.
    Signed-off-by: Mel Gorman <mgorman at suse.de>
    Acked-by: Johannes Weiner <hannes at cmpxchg.org>
    Reviewed-by: Zhang Yanfei <zhangyanfei at cn.fujitsu.com>
    Acked-by: Michal Hocko <mhocko at suse.cz>
    Reviewed-by: Christoph Lameter <cl at linux.com>
    Signed-off-by: Andrew Morton <akpm at linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds at linux-foundation.org>

