[Beowulf] When is compute-node load-average "high" in the HPC context? Setting correct thresholds on a warning script.

Reuti reuti at staff.uni-marburg.de
Wed Sep 1 04:27:44 PDT 2010


Am 01.09.2010 um 12:15 schrieb Marian Marinov:

> On Wednesday 01 September 2010 11:47:29 Reuti wrote:
>> Am 01.09.2010 um 09:34 schrieb Christopher Samuel:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>> 
>>> On 01/09/10 01:58, Reuti wrote:
>>>> With recent kernels also (kernel) processes in D state
>>>> count as running.
>>> 
>>> I wouldn't say recent, that goes back as far as I can
>>> remember.
>>> 
>>> For instance I've seen RHEL3 (2.4.x - sort of) NFS servers
>>> with load averages in the 80's where they were run with a lot
>>> of nfsd's that were blocked waiting for I/O due to ext3.
>> 
>> My impression was always (as there is a similar setting for the
>> load_threshold in OGE), that it should limit the number of jobs on a big
>> SMP machine when you oversubscribe by intention, as not all parallel jobs
>> are really using all the CPU power over their lifetime (maybe such a
>> machine was even operated w/o any NFS). Then allowing e.g. 72 slots for
>> jobs on a 60 core maschine might get most out of it with a load near 100%.
>> 
>> Well, getting now 12 cores in newer CPUs and assemble them to 24 or 48 core
>> machines would make such a setting useful again. Maybe the load sensor
>> should honor only the scheduled jobs' load.
>> 
>> -- Reuti
>> 
>>> cheers!
>>> Chris
> 
> I believe that the load threshold should be set depending on the type of jobs 
> you run on your compute nodes.
> 
> In some cases the load is not linked only to disk/network I/O and CPU, 
> sometimes the jobs do a lot of in memory changes which bring more weight

I thought the load is just the number of processes which are eligible to run and in addition today which are in D state. But a single serial process w/o threads or forks shouldn't get the load over 1 by writing a lot to memory.

-- Reuti


> then 
> the actual CPU or disk/network I/O. So for example a load average of 15 can 
> also be considered for normal load, as far as the system is still responsive 
> and the jobs time don't degrade.
> 
> -- 
> Best regards,
> Marian Marinov
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf





More information about the Beowulf mailing list