[Beowulf] Definition of HPC

Tue Apr 23 13:05:05 PDT 2013

>> Because it stopped the random out of memory conditions that we were having.
>
> aha, so basically "rebooting windows resolves my performance problems" ;)

in other words, a workaround.  I think it's important to note when behavior
is a workaround, so that it don't get ossified into SOP.

> Mark, I don't understand your forcefulness here.

it's very simple: pagecache and VM balancing is a very important part of the
kernel, and has received a lot of quite productive attention over the years.
I question the assumption that "rebooting the pagecache" is a sensible way to 
deal with memory-tuning problems.  it seems very passive-aggressive to me:
as if there is assumption that the kernel isn't or can't Do The Right Thing 
for HPC.

drop_caches is such a brutal, sledghammer thing.  for instance, it can't be 
used if there are multiple jobs on the host.  it assumes there is absolutely
zero sharing between jobs (executables, contents of /etc/ld.so.cache, etc).

for sites where a single job is rolled onto all nodes and runs for a long 
time, then is entirely removed, sure, it may make sense.  rebooting entirely
might even work better.  I'm mainly concerned with clusters which run a 
wide mixture of jobs, probably with multiple jobs sharing a node at times.

> All modern compute nodes are essentially NUMA machines (I am assuming all are dual or more socket machines).

it depends.  we have some dual E5-2670 nodes that have two memory nodes - 
I strongly suspect that they do not need any pagecache-reboot, since 
they have just 2 normal zones to balance.  obviously, 4-chip nodes
(including AMD dual-G34 systems) have an increased chance of fragmentation.
similarly, if you shell out for a MANY-node system, and run a single job
at a time on it, you should certainly be more concerned with whether the 
kernel can balance all your tiny little memory zones.  standard statistics
apply: if the kernel balances a zone well .99 of the time, anyone with 
a few hundred zones will be very unhappy sometimes.

in short, all >1s servers are NUMA, but that doesn't mean you should drop_caches.

> If caches are a large fraction of memory then you have increased memory
> requests from the foreign node.

wait, slow down.  first, why are you assuming remote-node access?  do your 
jobs specifically touch a file from one node, populating the pagecache,
then move to another node to perform the actual IO?  we normally have a rank
wired to a particular core for its life.

yes, it's certainly possible for high IO to consume enough pagecache to also 
occupy space on remote nodes.  are you sure this is bad though?  pagecache is 
quite deliberately treated as a low-caste memory request - normally pagecache
scavenges its own current usage under memory pressure.  and the alternative 
is to be doing uncached IO (or pagecache misses).

I often also meet people who think that having free memory is a good thing,
when in fact it means "I bought too much ram".  that's a little over the top,
of course, but the real message is that having all your ram occupied,
even or especially by pagecache, is good, since pagecache is so efficiently
scavenged.  (no IO, obviously - the Inactive fields in /proc/meminfo are 
lists dedicated to this sort of easy scavenging.)

> Surely for HPC workloads resetting the system so that you get deterministic run times is a good thing?

who says determinism is a good thing?  I assume, for instance, you turn off 
your CPU caches to obtain determinism, right?  I'm not claiming that variance
is good, but why do you assume that the normal functioning of the pagecache 
will cause it?

> It would be good to see if there really is a big spread in run time - as I
>think there should be.

why?  what's your logic?  afaikt, you're assuming that the caches are either 
preventing memory allocations, causing swapping, or are somehow expensive to 
scavenge.  while that's conceivable, it would be a kernel bug.

here's the best justification I can think of for drop_caches: you run in a 
dedicated-host environment, but have a write-heavy workload.  and yet each 
job shares NOTHING with its successor, not even references to /bin/sh.
and each job spews out a vast bolus of writes at its end, and you don't want
to tune vm.dirty* to ensure this write happens smoothly.  instead, you want
to idle all your cpus and memory to ensure that the writes are synced before
letting the next job run.  (it will, of course, spend its initial seconds 
mainly missing the pagecache, so cpus/memory will be mostly idle during this 
time as well.)  but at least you can in good concience charge the next user
for exclusive access, even though the whole-system throughput is lower due
to all the IO waiting and stalling.

as I said before, it's conceivable that the kernel has bugs related to
scavenging - maybe it actually does eat a lot of cycles.  since scavenging 
clean pages is a normal, inherent behavior, this would be somewhat worrisome,
and should be reported/escalated as a Big Deal.  after all, all read IO
happens on this path, and all (non-O_DIRECT) writes *also* follow this path,
since a dirty page gets cleaned (synced), then scavenged like all the other
clean pages.