[Beowulf] [OOM killer/scheduler] disabling swap on cluster nodes?
Joe Landman
landman at scalableinformatics.com
Wed Feb 11 07:48:04 PST 2015
On 02/11/2015 12:25 AM, Mark Hahn wrote:
>>> is net-swap really ever a good idea? it always seems like asking for
>>> trouble, though in principle there's no reason why net IO should be
>>> "worse" than disk IO...
>>
>> ... except for the need to allocate memory to build packets to send
>> the swap data.
>
> I thought the implication was clear, that doing disk IO may also require
> memory allocations.
Paging to local scratch is less memory intensive than constructing a
memory packet to hold a buffer for transfer over the network. In fact
local paging is really quite memory efficient.
>
>> There are still a few places that look at you funny if you suggest
>> running w/o swap. The 6 orders of magnitude performance difference
>> for random page touching performance suggests you should stare them
>> back down.
>
> absolutely: if you have reason to believe all your pages are uniformly hot,
> more power to you!
Bad analysis. In the old days (ugh), locality of reference was
something you had to work very hard to make sure you made effective use
of your memory. You re-ordered your loops, did all manner of other
things. Nowadays, you have to worry about objects and their instance
data, which you don't know so much when and where they will be touched.
Feel free to use a modern OO code on a memory starved system ... its
just not pleasant. That 6 OOM performance variance between hot and cold
pages will bite you.
>> Seriously, if you can avoid under-spec'ing/provisioning ram, you should.
>
> in other words: buy extra ram to hold your cold pages! after all, dram
> is only O($10/GB), and disk is O($0.05/GB). oh, wait...
And this is what I was waiting for, someone to pull out a bad analysis
and then use it as a strawman.
Ok, using your underlying theory here (disk is cheap, ram is expensive),
lets go to zero ram and save money.
Oh ... Wait ...
Yes, it should be obvious why this is silly. And by extension, the
original argument is silly.
But the more subtle point (which is the one I had hoped you would go
for, as its the one that makes sense) is that there is a fine balancing
between size of ram and (if you use it) swap. This balancing act is
influenced by the opportunity cost of decisions (less ram -> more swap,
longer execution time/cost for memory intensive codes; versus more ram
-> less swap, shorter execution time, though higher cost per node).
In fact this gets to the very definition of opportunity cost, what is
the amount of value I am giving up by making the alternative choice.
Another way of thinking of this is asking what the marginal value of the
choice of more or less ram?
This is why I argue that sizing memory (and almost all other things) is
very important. Building a 1TB ram machine for problems that run in 4GB
is a waste of resources (too much ram). Building a 16GB ram machine for
problems that run in 1TB is a waste (too little).
>> wish for the wild west of OOM shooting random things in comparison to
>> random 4k page touches. Yes, I've seen the latter.
>
> thrashing is bad. it's not the same as *using* swap. that's why swap
> still makes sense.
Thrashing *is* using swap as a transparent memory extension. It is one
of the worst possible cases, and seen quite frequently when you have
large OO codes where you can't predict what object is going to do what.
Or you have large in memory databases. Or ...
That is, swap/paging provide a memory extension, and its a crutch
relative to in-app memory management. The latter is generally frowned
upon in most development circles these days, especially with GC systems
in OO code.
The world has evolved significantly since I spilled my first matrices to
local files.
> interesting thought: SSD is about $0.5/GB, so would make a great swap
> dev - has anyone tried tuning the swap cluster size to match the SSD
> flash block?
We've done quite a bit of this, yes.
What it comes down to is, a) swap is a terrible thing to do, avoid it if
possible. b) if you can't avoid it, do it as quickly as you can. c) the
incremental cost of increasing RAM size versus paying the (often far)
longer run time (with all its attendant costs and effects: slower
throughput, fewer jobs per unit time, more power spent per job, etc.) is
heavily biased *against* building sizable swap. This is why we use zram
whenever possible, zcache, and very fast tuned swap partitions whenever
possible.
Note though, and this has happened to us before: if a swap device dies
while you have pages out on it ... lets just say thats a new experience
in crashing. Its exactly like pulling a random DIMM out of a running
machine.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
e: landman at scalableinformatics.com
w: http://scalableinformatics.com
t: @scalableinfo
p: +1 734 786 8423 x121
c: +1 734 612 4615
More information about the Beowulf
mailing list