[Beowulf] Interactive vs batch, and schedulers

Fri Jan 17 09:52:47 PST 2020

On Thu, 16 Jan 2020 23:24:56 "Lux, Jim (US 337K)" wrote:
> What I’m interested in is the idea of jobs that, if spread across many
> nodes (dozens) can complete in seconds (<1 minute) providing
> essentially “interactive” access, in the context of large jobs taking
> days to complete.   It’s not clear to me that the current schedulers
> can actually do this – rather, they allocate M of N nodes to a
> particular job pulled out of a series of queues, and that job “owns”
> the nodes until it completes.  Smaller jobs get run on (M-1) of the N
> nodes, and presumably complete faster, so it works down through the
> queue quicker, but ultimately, if you have a job that would take, say,
> 10 seconds on 1000 nodes, it’s going to take 20 minutes on 10 nodes.

Generalizations are prone to failure but here we go anyway...

If there is enough capacity and enough demand for both classes of jobs 
one could set up queues for the specific types, to keep the big ones and 
the small ones apart, with pretty much constant utilization.

In some instances it may be possible to define the benefit (in some 
unit, let's say dollars) for obtaining a given job's results and also 
define the costs (in the same units) for node/hours, wait time, and 
other resources.  Using that function it might be possible to schedule 
the job mix to maximize "value", at least approximately.  Based solely 
on times and nodes, without some measure of benefit and costs it might 
be possible to optimize node utilization (by some measure), but spinning 
the CPUs isn't really the point of the resource, right?  I expect that 
whatever job mix maximizes value will also maximize optimization, but 
not necessarily the other way around.  I bet that AWS's scheduler uses 
some sort of value calculation like that.

A somewhat related problem occurs when there are slow jobs which use a 
lot of memory but cannot benefit from all the CPUs on a node.  (Ie, they 
scale poorly.)  Better utilization is possible if CPU efficient/low 
memory jobs can be run at the same time on those nodes if there are then 
"spare" CPUs.  If done just right this is win win, with both jobs 
running at close to their optimal speeds.  This is tricky though because 
if the total memory usage cannot be calculated ahead of time to be sure 
there is enough the two jobs can end up fighting over that resource with 
run times going way way up when page faulting occurs or jobs crashing 
when the system runs out of memory.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech