Reliability analysis was RE: Windows HPC (@ Cornell)

Wed Nov 6 16:05:16 PST 2002

>
>
>** Reliability. We ran a 256-processor Dell cluster with Windows 2000
>and collected all errors (OS, I/O, hardware) on a secure web site for 6
>months. MIT analyzed and independently verified the up-time--99.9986%.

Couldn't find the real email to respond to here, but the above excerpt 
captures it...

Has anyone running a cluster done some real reliability analysis and 
published the data and analysis?  Not necessarily peer reviewed.. even a 
good web page description would do.

For instance, Paul at Cornell has claimed better than 99.999% reliability 
or uptime, but hasn't provided any numerical backup for the assertion, 
except to claim that MIT analyzed some unspecified set of data.  Is Cornell 
going to publish the data and details of the analysis?  For instance, what 
was the reliability model used? What failure statistical distribution was 
implied (Exponential? Weibull?) What's defined as "failure" or "up time"? I 
searched the entire Cornell site using their search engine, and all I found 
was a couple of marketing speak type presentations that didn't provide any 
numerical backup for the assertions.

I think this would be a very useful thing as a point of discussion for 
clusters in general, since the terms "high availability", "high 
reliability", "MTBF" and so forth are bandied about pretty freely, without 
any unambiguous definitions.  There's a lot of literature and discussion on 
performance and how to fairly evaluate it in terms of BogoMIPs, or GFlops, 
or bisection bandwidth, etc, but not nearly as much on other aspects of 
running a cluster.

I would think that a reasonably rigorous analysis would need to address 
things like (re)boot time, mean time to repair, the difference between 
"operating system up and ready" and "actually running user code", and so 
forth. Maybe a good start would be to establish a common terminology for 
things, and then we can argue/discuss how to boil down 
measurements/predictions to a single "figure of merit".

RGB.. maybe another chapter for your book?