[Beowulf] When is compute-node load-average "high" in the HPC context? Setting correct thresholds on a warning script.
Reuti
reuti at staff.uni-marburg.de
Wed Sep 1 04:27:44 PDT 2010
Am 01.09.2010 um 12:15 schrieb Marian Marinov:
> On Wednesday 01 September 2010 11:47:29 Reuti wrote:
>> Am 01.09.2010 um 09:34 schrieb Christopher Samuel:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>>
>>> On 01/09/10 01:58, Reuti wrote:
>>>> With recent kernels also (kernel) processes in D state
>>>> count as running.
>>>
>>> I wouldn't say recent, that goes back as far as I can
>>> remember.
>>>
>>> For instance I've seen RHEL3 (2.4.x - sort of) NFS servers
>>> with load averages in the 80's where they were run with a lot
>>> of nfsd's that were blocked waiting for I/O due to ext3.
>>
>> My impression was always (as there is a similar setting for the
>> load_threshold in OGE), that it should limit the number of jobs on a big
>> SMP machine when you oversubscribe by intention, as not all parallel jobs
>> are really using all the CPU power over their lifetime (maybe such a
>> machine was even operated w/o any NFS). Then allowing e.g. 72 slots for
>> jobs on a 60 core maschine might get most out of it with a load near 100%.
>>
>> Well, getting now 12 cores in newer CPUs and assemble them to 24 or 48 core
>> machines would make such a setting useful again. Maybe the load sensor
>> should honor only the scheduled jobs' load.
>>
>> -- Reuti
>>
>>> cheers!
>>> Chris
>
> I believe that the load threshold should be set depending on the type of jobs
> you run on your compute nodes.
>
> In some cases the load is not linked only to disk/network I/O and CPU,
> sometimes the jobs do a lot of in memory changes which bring more weight
I thought the load is just the number of processes which are eligible to run and in addition today which are in D state. But a single serial process w/o threads or forks shouldn't get the load over 1 by writing a lot to memory.
-- Reuti
> then
> the actual CPU or disk/network I/O. So for example a load average of 15 can
> also be considered for normal load, as far as the system is still responsive
> and the jobs time don't degrade.
>
> --
> Best regards,
> Marian Marinov
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list