[Beowulf] Cluster Metrics? (Upper management view)

Reuti reuti at staff.uni-marburg.de
Fri Aug 20 12:21:14 PDT 2010


Am 20.08.2010 um 19:34 schrieb Stuart Barkley:

> What sort of business management level metrics do people measure on
> clusters?  Upper management is asking for us to define and provide
> some sort of "numbers" which can be used to gage the success of our
> cluster project.
> We currently have both SGE and Torque/Moab in use and need to measure
> both if possible.
> I can think of some simple metrics (well sort-of, actual technical
> definition/measurement may be difficult):
> - 90/95th percentile wait time for jobs in various queues.  Is smaller
> better meaning the jobs don't wait long and users are happy?  Is
> larger better meaning that we have lots of demand and need more
> resources?
> - core-hours of user computation (per queue?) both as raw time and
> percentage of available time.  Again, which is better (management
> view) higher or lower?
> - Availability during scheduled hours (ignoring scheduled maintenance
> times).  Common metric, but how do people actually measure/compute
> this?  What about down nodes?  Some scheduled percentage (5%?) assumed
> down?
> - Number of new science projects performed.  Vague, but our
> applications support people can just count things occasionally.

I use the -A option in SGE (it's also in Torque) to fill this field with the type of application used for the job. For SGE it's just a comment and not taken into account for any share tree policy. This field is also recorded in the accounting file. For somes type of jobs we even record the used submission command for the job in the context of the job (`qsub -ac ...`).

> Misses users who just use the system without interaction with us.

With a JSV (job submission verifier) which fills -A automatically, maybeyou can find the people who are not interacting with you.

> Misses "production" work that just keeps running.


It's not so straight forward to measure success, like already mentioned. You can have 75% CPU load because:

- your parallel jobs are not really scaling with the number of slots used

(it can be possible to run additonal serial jobs on these nodes with a nice of 19, just to gather the otherwise wasted CPU cycles for some types of parallel applications; when these "background" jobs are happy that they run slower because of the nice value)

or 75% slot load:

- you request resources like memory or for parallel jobs slots, these resource might become reserved and when there is no small job available for backfilling, they are just idling

(it can be possible to run additonal serial jobs in a queue which gets suspended when the main queue gets actually used; when these "background" jobs are happy with the non-reserved resources)

-- Reuti

> Any comments or ideas are welcome.
> Thanks,
> Stuart Barkley
> -- 
> I've never been lost; I was once bewildered for three days, but never lost!
>                                        --  Daniel Boone
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list