[Beowulf] Cluster Metrics? (Upper management view)

Fri Aug 20 11:26:51 PDT 2010

I think measuring a clusters success based on the number of jobs run
or cpu's used is a bad measure of true success.  I would be more
inclined to consider a cluster a success by speaking with the people
who use it and find out not only whether they can use it effectively
and/or what new science having cluster is being enabled by them.

then only thing i find most of the below metrics overly useful is
figuring out whether or not we need a bigger cluster.  which i guess
is a form of measurable success, but not one in which i would consider
the "cluster" to be a success.  it could just be dopes running
thousands of "/bin/hostname" jobs trying to figure out how to use the
cluster

I also think you need to ask the "business" people what measure they
would consider a cluster as a worthwhile investment, it doesn't sound
as if you have that from your email.

On Fri, Aug 20, 2010 at 1:34 PM, Stuart Barkley <stuartb at 4gh.net> wrote:
> What sort of business management level metrics do people measure on
> clusters?  Upper management is asking for us to define and provide
> some sort of "numbers" which can be used to gage the success of our
> cluster project.
>
> We currently have both SGE and Torque/Moab in use and need to measure
> both if possible.
>
> I can think of some simple metrics (well sort-of, actual technical
> definition/measurement may be difficult):
>
> - 90/95th percentile wait time for jobs in various queues.  Is smaller
> better meaning the jobs don't wait long and users are happy?  Is
> larger better meaning that we have lots of demand and need more
> resources?
>
> - core-hours of user computation (per queue?) both as raw time and
> percentage of available time.  Again, which is better (management
> view) higher or lower?
>
> - Availability during scheduled hours (ignoring scheduled maintenance
> times).  Common metric, but how do people actually measure/compute
> this?  What about down nodes?  Some scheduled percentage (5%?) assumed
> down?
>
> - Number of new science projects performed.  Vague, but our
> applications support people can just count things occasionally.
> Misses users who just use the system without interaction with us.
> Misses "production" work that just keeps running.
>
> Any comments or ideas are welcome.
>
> Thanks,
> Stuart Barkley
> --
> I've never been lost; I was once bewildered for three days, but never lost!
>                                        --  Daniel Boone
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>