[Beowulf] Cluster Metrics? (Upper management view)

Fri Aug 20 10:34:25 PDT 2010

What sort of business management level metrics do people measure on
clusters?  Upper management is asking for us to define and provide
some sort of "numbers" which can be used to gage the success of our
cluster project.

We currently have both SGE and Torque/Moab in use and need to measure
both if possible.

I can think of some simple metrics (well sort-of, actual technical
definition/measurement may be difficult):

- 90/95th percentile wait time for jobs in various queues.  Is smaller
better meaning the jobs don't wait long and users are happy?  Is
larger better meaning that we have lots of demand and need more
resources?

- core-hours of user computation (per queue?) both as raw time and
percentage of available time.  Again, which is better (management
view) higher or lower?

- Availability during scheduled hours (ignoring scheduled maintenance
times).  Common metric, but how do people actually measure/compute
this?  What about down nodes?  Some scheduled percentage (5%?) assumed
down?

- Number of new science projects performed.  Vague, but our
applications support people can just count things occasionally.
Misses users who just use the system without interaction with us.
Misses "production" work that just keeps running.

Any comments or ideas are welcome.

Thanks,
Stuart Barkley
-- 
I've never been lost; I was once bewildered for three days, but never lost!
                                        --  Daniel Boone