[Beowulf] When is compute-node load-average "high" in the HPC context? Setting correct thresholds on a warning script.

Tue Aug 31 07:51:20 PDT 2010

My scheduler, Torque flags compute-nodes as "busy" when the load gets
above a threshold "ideal load". My settings on 8-core compute nodes
have this ideal_load set to 8 but I am wondering if this is
appropriate or not?

$max_load 9.0
$ideal_load 8.0

I do understand the"ideal load = # of cores" heuristic but in at least
30% of our jobs ( if not more ) I find the load average greater than
8. Sometimes even in the 9-10 range. But does this mean there is
something wrong or do I take this to be the "happy" scenario for HPC:
i.e. not only are all CPU's busy but the pipeline of processes waiting
for their CPU slice is also relatively full. After all, a
"under-loaded" HPC node is a waste of an expensive resource?

On the other hand, if there truly were something wrong with a node[*]
and I was to use a high load avearage  as one of the signs of
impending trouble what would be a good threshold? Above what
load-average on a compute node do people get actually worried? It
makes sense to set PBS's default "busy" warning to that limit instead
of just "8".

I'm ignoring the 5/10/15 min load average distinction. I'm assuming
Torque is using the most appropriate one!

*e.g. runaway process, infinite loop in user code, multiple jobs
accidentally assigned to some node etc.

-- 
Rahul