[Beowulf] Avoiding/mitigating fragmentation of systems by small jobs?

Skylar Thompson skylar.thompson at gmail.com
Mon Jun 11 06:18:58 PDT 2018


On Mon, Jun 11, 2018 at 02:36:14PM +0200, John Hearns via Beowulf wrote:
> Skylar Thomson wrote:
> >Unfortunately we don't have a mechanism to limit
> >network usage or local scratch usage, but the former is becoming less of a
> >problem with faster edge networking, and we have an opt-in bookkeeping
> mechanism
> >for the latter that isn't enforced but works well enough to keep people
> happy.
> That is interesting to me. At ASML I worked on setting up Quality of
> Service, ie bandwidth limits, for GPFS storage and MPI traffic.
> GPFS does have QoS limits inbuilt, but these are intended to limit the
> backgrouns housekeeping tasks rather than to limit user processes.
> But it does have the concept.
> With MPI you can configure different QoS levels for different traffic.
> 
> More relevently I did have a close discussion with Parav Pandit who is
> working on the network QoS stuff.
> I am sure there is something more up to date than this
> https://www.openfabrics.org/images/eventpresos/2016presentations/115rdmacont.pdf
> Sadly this RDMA stuff needs a recent 4-series kernel. I guess the
> discussion on whether or not you should go with a bleeding edge kernel is
> for another time!
> But yes cgroups have configurable network limits with the latest kernels.
> 
> Also being cheeky, and I probably have mentioned them before, here is a
> plug for Ellexus https://www.ellexus.com/
> Worth mentioning I have no connection with them!

Thanks for the pointer to Ellexus - their I/O profiling does look like
something that could be useful for us. Since we're a bioinformatics shop
and mostly storage-bound rather than network-bound, we haven't really
needed to worry about node network limitations (though occassionally have
had to worry about ToR or chassis switch limitations), but have really
suffered at times when people assume that disk performance is limitless,
and random access is the same as sequential access.

-- 
Skylar


More information about the Beowulf mailing list