[Beowulf] Should I go for diskless or not?
larry.stewart at sicortex.com
Fri May 15 12:49:43 PDT 2009
On May 15, 2009, at 1:12 PM, Ashley Pittman wrote:
> On Fri, 2009-05-15 at 06:43 -0400, Lawrence Stewart wrote:
>> I'll echo the remarks about swapping, there is a large patch set for
>> swapping over IP, and we don't run that. In fact right now we run
>> without swap space, and vm_overcommit_ratio set to "90". This is
>> generous enough that we're not having problems, even on large
>> with running out of memory. Everyone seems to agree that having some
>> swap space is good for stability, so we do plan to add swap at some
>> point. We've got a new network block device that can swap over the
>> interconnect (without any allocations) at about 2 GB/s which is
>> good enough to make DSM interesting. If you have local disks, using
>> them for swap will work fine.
> Another problem which nobodys mentioned yet is where are you going to
> swap too? Sure each node might have 2GB/s network bandwidth to play
> but no frontend is going to cope with more than a handful of nodes
> swapping at once. It might be viable for a network of diskless
> workstations but for a cluster forget it.
> The only way that network swapping can make sense in a cluster is if
> know the application doesn't fit in memory and can allocate some extra
> nodes to host the swapped memory, preferably swapping over the network
> to RAM on a remote machine. This doubles the nodes required to run
> job however and makes scheduling it with normal jobs impossible.
> Ashley Pittman,
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Ashley's comments are right on. There are really two reasons for
* low performance swap, just to let the real application run closer
to 100% of memory.
If the app is swapping performance will be awful, but just having some
swap will let the
OS move other junk out of the way. Also, having modest swap will keep
the OOM killer at bay.
* "buddy" swapping, to let you page to other nodes' memory. If
paging is pretty fast to such
a ramdisk, then you may be able to run larger or more interesting
applications. Our nodes don't have unused dimm sockets, so you can't
just plug in more. We do a related thing also, by having a lustre
parallel filesystem backed by RAM on some allocated nodes. This is
surprisingly useful for scatch space and temp files, and it is a lot
faster than any affordable disk subsystem.
More information about the Beowulf