[Beowulf] GPU diagnostics?

Mon Mar 30 14:10:35 PDT 2009

On Mon, 30 Mar 2009, David Mathog wrote:

> Joe Landman wrote:
> > Vendors have an nVidia supplied *GEMM based burn in test.  Been thinking 
> > about a set of diagnostics end users can run as a sanity check.
> 
> My suspicion is that vendors run such burn in tests only for a very
> brief time.  That time being "the minimum time required to find the
> percentage of failed units above which it would cost us more if they
> were found to be bad in the field" - and not a second longer.

I don't know about other vendors, but that's not Penguin's approach.

One reason is that we don't know the failure profile.  But really it's a 
trade-off between delivery expectations, likelihood of failures, and even 
how much air conditioning capacity remains in the burn-in room.

We used to have a published policy of a minimum three day 
successful burn-in.  If a part failed, or even if the machine rebooted, the 
three day clock started again.  

The challenge with that policy is that it leads to unpredictable delivery, 
which is distressing to someone that needs servers or workstations Right 
Now.

Today the policy is much more flexible, in part driven by Penguin's 
change to building mostly clusters.  Burn-in time is based on the 
product, potentially modified by per-machine notes on the customer 
delivery requirements.

Cluster nodes have a preliminary stand-alone burn-in before being racked
into a cluster.  Whole clusters then have a full burn-in, usually running
benchmarks and demo applications.

You might expect nearly zero errors when already-tested machines are
grouped in a cluster, but cluster applications can reveal errors that
typical burn-in tests don't trigger.  And even a low percentage of 
failures looks pretty bad when you have a few hundred machines in a 
cluster.

> Finding
> marginal memory, certainly one of the easier tests, can easily take 24
> hours of testing.

And typically those memory modules test OK in a tester, even after being 
pulled from a machine showing memory errors.  (That's not surprising, since 
most distributors test modules just before shipping them, and they are 
tested again just before installation.)

>  Somehow I cannot imagine vendors spending quite that
> long burning in a graphics card.  Well, maybe a top of the line pro
> card, but certainly not your run of the mill $39 budget card.

I'm guessing every vendor shipping big clusters or CUDA GPU systems does a
substantial burn-in, although it's likely rare that they use parallel
applications and check for successful runs.

It's consumer-oriented low end production lines that can't fit a longer
burn-in into the process.  A production line with pre-imaged OS
installations pretty much cannot do a full burn-in.

-- 
Donald Becker				becker at scyld.com
Penguin Computing / Scyld Software
www.penguincomputing.com		www.scyld.com
Annapolis MD and San Francisco CA