[Beowulf] When is compute-node load-average "high" in the HPC context? Setting correct thresholds on a warning script.
reuti at Staff.Uni-Marburg.DE
Tue Aug 31 08:58:37 PDT 2010
Am 31.08.2010 um 16:51 schrieb Rahul Nabar:
> My scheduler, Torque flags compute-nodes as "busy" when the load gets
> above a threshold "ideal load". My settings on 8-core compute nodes
> have this ideal_load set to 8 but I am wondering if this is
> appropriate or not?
> $max_load 9.0
> $ideal_load 8.0
> I do understand the"ideal load = # of cores" heuristic but in at least
> 30% of our jobs ( if not more ) I find the load average greater than
> 8. Sometimes even in the 9-10 range. But does this mean there is
> something wrong or do I take this to be the "happy" scenario for HPC:
> i.e. not only are all CPU's busy but the pipeline of processes waiting
> for their CPU slice is also relatively full. After all, a
> "under-loaded" HPC node is a waste of an expensive resource?
With recent kernels also (kernel) processes in D state count as running. Hence the load appears higher than the running processes would imply when only these are added up.
> On the other hand, if there truly were something wrong with a node[*]
> and I was to use a high load avearage as one of the signs of
> impending trouble what would be a good threshold? Above what
> load-average on a compute node do people get actually worried? It
> makes sense to set PBS's default "busy" warning to that limit instead
> of just "8".
> I'm ignoring the 5/10/15 min load average distinction. I'm assuming
> Torque is using the most appropriate one!
> *e.g. runaway process, infinite loop in user code, multiple jobs
> accidentally assigned to some node etc.
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf