[Beowulf] NUMA zone_reclaim_mode considered harmful?
samuel at unimelb.edu.au
Fri Sep 19 11:09:25 PDT 2014
Over on the xCAT mailing list I've been involved in a thread relating
to diverse settings of zone_reclaim_mode across nodes in clusters.
It starts here with Stuarts good description of the problem and
I did some poking around on our systems and was able to confirm that
whilst our newer iDatplex (dx360 M4's) with SB CPUs all had
zone_reclaim_mode set to 0 our older iDataplex with Nehalems (dx360
M2's) were all 1, along with an older SGI UV10 (Westmere).
The clincher was the fact that on that same cluster we had some IBM
x3690 X5 with Maxx5's which boot an identical diskless image and they
had zone_reclaim_mode set to 0, not 1.
Turns out that this is indeed autotuned by older kernels, with this
text in the kernel Documentation/sysctl/vm.txt
# zone_reclaim_mode is set during bootup to 1 if it is determined
# that pages from remote zones will cause a measurable performance
# reduction. The page allocator will then reclaim easily reusable
# pages (those page cache pages that are currently not used) before
# allocating off node pages.
However, in 3.16 a patch was committed that disabled this auto-tuning,
turning off zone reclamation by default.
It's probably worth checking your own x86-64 systems to see if this
is set for you and benchmarking with it disabled if it is..
Here's that patch with the description..
Author: Mel Gorman <mgorman at suse.de>
Date: Wed Jun 4 16:07:14 2014 -0700
mm: disable zone_reclaim_mode by default
When it was introduced, zone_reclaim_mode made sense as NUMA distances
punished and workloads were generally partitioned to fit into a NUMA
node. NUMA machines are now common but few of the workloads are
NUMA-aware and it's routine to see major performance degradation due to
zone_reclaim_mode being enabled but relatively few can identify the
Those that require zone_reclaim_mode are likely to be able to detect
when it needs to be enabled and tune appropriately so lets have a
sensible default for the bulk of users.
This patch (of 2):
zone_reclaim_mode causes processes to prefer reclaiming memory from
local node instead of spilling over to other nodes. This made sense
initially when NUMA machines were almost exclusively HPC and the
workload was partitioned into nodes. The NUMA penalties were
sufficiently high to justify reclaiming the memory. On current machines
and workloads it is often the case that zone_reclaim_mode destroys
performance but not all users know how to detect this. Favour the
common case and disable it by default. Users that are sophisticated
enough to know they need zone_reclaim_mode will detect it.
Signed-off-by: Mel Gorman <mgorman at suse.de>
Acked-by: Johannes Weiner <hannes at cmpxchg.org>
Reviewed-by: Zhang Yanfei <zhangyanfei at cn.fujitsu.com>
Acked-by: Michal Hocko <mhocko at suse.cz>
Reviewed-by: Christoph Lameter <cl at linux.com>
Signed-off-by: Andrew Morton <akpm at linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds at linux-foundation.org>
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
More information about the Beowulf