[Beowulf] When is compute-node load-average "high" in the HPC context? Setting correct thresholds on a warning script.
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Rahul Nabar rpnabar at gmail.comTue Aug 31 07:51:20 PDT 2010
- Previous message: [Beowulf] typical protocol for cleanup of /tmp: on reboot? cron job? tmpfs?
- Next message: [Beowulf] When is compute-node load-average "high" in the HPC context? Setting correct thresholds on a warning script.
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
My scheduler, Torque flags compute-nodes as "busy" when the load gets above a threshold "ideal load". My settings on 8-core compute nodes have this ideal_load set to 8 but I am wondering if this is appropriate or not? $max_load 9.0 $ideal_load 8.0 I do understand the"ideal load = # of cores" heuristic but in at least 30% of our jobs ( if not more ) I find the load average greater than 8. Sometimes even in the 9-10 range. But does this mean there is something wrong or do I take this to be the "happy" scenario for HPC: i.e. not only are all CPU's busy but the pipeline of processes waiting for their CPU slice is also relatively full. After all, a "under-loaded" HPC node is a waste of an expensive resource? On the other hand, if there truly were something wrong with a node[*] and I was to use a high load avearage as one of the signs of impending trouble what would be a good threshold? Above what load-average on a compute node do people get actually worried? It makes sense to set PBS's default "busy" warning to that limit instead of just "8". I'm ignoring the 5/10/15 min load average distinction. I'm assuming Torque is using the most appropriate one! *e.g. runaway process, infinite loop in user code, multiple jobs accidentally assigned to some node etc. -- Rahul
- Previous message: [Beowulf] typical protocol for cleanup of /tmp: on reboot? cron job? tmpfs?
- Next message: [Beowulf] When is compute-node load-average "high" in the HPC context? Setting correct thresholds on a warning script.
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
