[Beowulf] When is compute-node load-average "high" in the HPC context? Setting correct thresholds on a warning script.
reuti at staff.uni-marburg.de
Wed Sep 1 01:47:29 PDT 2010
Am 01.09.2010 um 09:34 schrieb Christopher Samuel:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> On 01/09/10 01:58, Reuti wrote:
>> With recent kernels also (kernel) processes in D state
>> count as running.
> I wouldn't say recent, that goes back as far as I can
> For instance I've seen RHEL3 (2.4.x - sort of) NFS servers
> with load averages in the 80's where they were run with a lot
> of nfsd's that were blocked waiting for I/O due to ext3.
My impression was always (as there is a similar setting for the load_threshold in OGE), that it should limit the number of jobs on a big SMP machine when you oversubscribe by intention, as not all parallel jobs are really using all the CPU power over their lifetime (maybe such a machine was even operated w/o any NFS). Then allowing e.g. 72 slots for jobs on a 60 core maschine might get most out of it with a load near 100%.
Well, getting now 12 cores in newer CPUs and assemble them to 24 or 48 core machines would make such a setting useful again. Maybe the load sensor should honor only the scheduled jobs' load.
> - --
> Christopher Samuel - Senior Systems Administrator
> VLSCI - Victorian Life Sciences Computational Initiative
> Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.10 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> -----END PGP SIGNATURE-----
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf