[Beowulf] Large Dell, odd IO delays

Kilian Cavalotti kilian.cavalotti.work at gmail.com
Wed Feb 14 15:44:01 PST 2018


On Wed, Feb 14, 2018 at 2:26 PM, David Mathog <mathog at caltech.edu> wrote:
> Checked the hugepage settings and found a difference there.  The two systems
> that don't do this have  /sys/kernel/mm/redhat_transparent_hugepage/defrag
>
> always madvise [never]
>
> whereas the system with the issue has:
>
> [always] madvise never

THP defragmentation is definitely something that has bitten us in the
past, when under memory pressure, and we now default to [madvise]
pretty much everywhere (we're too timid to disable it entirely).

A good way to see if that's really the issue is to "echo never >
/sys/kernel/mm/redhat_transparent_hugepage/defrag" while the problem
is happening, while simultaneously monitoring the processes with htop,
for instance.
It's usually pretty instant:  if the issue is really with THP defrag,
then CPU usage for your stalling process should drop pretty much
immediately and things go back to normal.

Cheers,
--
Kilian


More information about the Beowulf mailing list