[Beowulf] big read triggers migration and slow memory IO?
mathog
mathog at caltech.edu
Thu Jul 9 14:27:40 PDT 2015
On 09-Jul-2015 11:54, James Cuff wrote:
> http://blog.jcuff.net/2015/04/of-huge-pages-and-huge-performance-hits.html
Well, that seems to be it, but not quite with the same symptoms you
observed. khugepaged never showed up, and "perf top" never revealed
_spin_lock_irqsave. Instead this is what "perf top" shows in my tests:
(hugepage=always, when migration/# process observed)
89.97% [kernel] [k] compaction_alloc
1.21% [kernel] [k] compact_zone
1.18% [kernel] [k] get_pageblock_flags_group
0.75% [kernel] [k] __reset_isolation_suitable
0.57% [kernel] [k] clear_page_c_e
(hugepage=always, when events/# process observed)
85.97% [kernel] [k] compaction_alloc
0.84% [kernel] [k] compact_zone
0.65% [kernel] [k] get_pageblock_flags_group
0.64% perf [.] 0x000000000005cff7
(hugepage=never)
29.86% [kernel] [k] clear_page_c_e
21.88% [kernel] [k] copy_user_generic_string
12.46% [kernel] [k] __alloc_pages_nodemask
5.70% [kernel] [k] page_fault
This is good, because "perf top" shows that the underlying issue
is compaction_alloc and compact_zone even though what top shows
is in one case migration/# and when locked to a cpu, events/#.
Switching hugepage always->never seems to make things work right away.
Switching hugepage never->always seems to take a while to break. In
order to get it to start failing many of the big files involved must be
copied to /dev/null again, even though they were presumably already in
file cache.
Searched for "compaction_alloc" and "compact_zone" and found a
suggestion here
https://structureddata.github.io/2012/06/18/linux-6-transparent-huge-pages-and-hadoop-workloads/
to do:
echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag
(transparent_hugepage is a link to redhat_transparent_hugepage).
Reenabled hugepage and reproduced the painfully slow IO, set defrag to
"never" and the IO was fast again, even though hugepage was still
enabled.
So on my machine the problem seems to be with hugepage defrag
specifically. Disabling just that is sufficient to resolve the issue,
it isn't necessary to take out all of hugepage. Will let
it run that way for a while and see if anything else shows up.
For future reference:
CentOS release 6.6 (Final)
kernel 2.6.32-504.23.4.el6.x86_64
Dell Inc. PowerEdge T620/03GCPM, BIOS 2.2.2 01/16/2014
48 Intel Xeon CPU E5-2695 v2 @ 2.40GHz (in /proc/cpuinfo)
RAM 529231456 kB (in /proc/meminfo)
Thanks all!
David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
More information about the Beowulf
mailing list