On 02/06/2015 05:35 PM, Prentice Bisbal wrote:
> Do any of you disable swap on your compute nodes?

Yes and no.  We try to keep it minimal if possible.  Usually on compute 
nodes, if we have physical disk, we set up a RAID1 and put swap on that ...

> I brought this up in a presentation I gave last night on HPC system 
> administration, but realized I never actually did this, or know of 
> anyone who has. I would tweak the vm.overcommit_memory setting, but 
> that's not the same as disabling swap altogether. I'd like to try 
> doing this in the future, but I prefer to learn from someone else's 
> mistakes first.

... though I gotta say, we've seen some really ... really wild death 
spirals of skb allocation failures on devices/networks/... trying to 
reach the resource that is to be used for swap.

These days, we recommend using a modern kernel with zswap and enough ram 
that you try to avoid swap at all costs.  That by the way, avoiding 
swapping by design, is really the right approach.  Not always possible, 
but swap is anywhere from 3 to 6 orders of magnitude slower than core 
memory, that you should be considering the bandwidth issues of using 
swap, the massive runtime extension of using swap, the overall impact 
upon your scheduler/throughput/etc. by using swap ... that you probably 
shouldn't use swap.

Ok, all that said ...

What we are looking at is using the flash on a dimm for swap.  Build a 
pair of DIMMs into a small (100GB or so) RAID1 type swap space. This 
will take some driver work.  Put these on the main MB.  Now swap isn't 
nearly as (time) expensive, though it is (in acquisition terms) more 
costly.  You could do the same thing with PCIe flash, but it uses a PCIe 
slot (very precious resources in a compute node), and they arent cheap.

Not to mention that I don't like the concept of using the IO channel for 
memory expansion.  I can't be the only one whom remembers EMM/XMM/XMS 
days of PCs ...  It sorta kinda worked, but its failure modes were ... 
well ... spectacular.

