[Beowulf] GPU diagnostics?
Donald Becker
becker at scyld.com
Mon Mar 30 14:10:35 PDT 2009
On Mon, 30 Mar 2009, David Mathog wrote:
> Joe Landman wrote:
> > Vendors have an nVidia supplied *GEMM based burn in test. Been thinking
> > about a set of diagnostics end users can run as a sanity check.
>
> My suspicion is that vendors run such burn in tests only for a very
> brief time. That time being "the minimum time required to find the
> percentage of failed units above which it would cost us more if they
> were found to be bad in the field" - and not a second longer.
I don't know about other vendors, but that's not Penguin's approach.
One reason is that we don't know the failure profile. But really it's a
trade-off between delivery expectations, likelihood of failures, and even
how much air conditioning capacity remains in the burn-in room.
We used to have a published policy of a minimum three day
successful burn-in. If a part failed, or even if the machine rebooted, the
three day clock started again.
The challenge with that policy is that it leads to unpredictable delivery,
which is distressing to someone that needs servers or workstations Right
Now.
Today the policy is much more flexible, in part driven by Penguin's
change to building mostly clusters. Burn-in time is based on the
product, potentially modified by per-machine notes on the customer
delivery requirements.
Cluster nodes have a preliminary stand-alone burn-in before being racked
into a cluster. Whole clusters then have a full burn-in, usually running
benchmarks and demo applications.
You might expect nearly zero errors when already-tested machines are
grouped in a cluster, but cluster applications can reveal errors that
typical burn-in tests don't trigger. And even a low percentage of
failures looks pretty bad when you have a few hundred machines in a
cluster.
> Finding
> marginal memory, certainly one of the easier tests, can easily take 24
> hours of testing.
And typically those memory modules test OK in a tester, even after being
pulled from a machine showing memory errors. (That's not surprising, since
most distributors test modules just before shipping them, and they are
tested again just before installation.)
> Somehow I cannot imagine vendors spending quite that
> long burning in a graphics card. Well, maybe a top of the line pro
> card, but certainly not your run of the mill $39 budget card.
I'm guessing every vendor shipping big clusters or CUDA GPU systems does a
substantial burn-in, although it's likely rare that they use parallel
applications and check for successful runs.
It's consumer-oriented low end production lines that can't fit a longer
burn-in into the process. A production line with pre-imaged OS
installations pretty much cannot do a full burn-in.
--
Donald Becker becker at scyld.com
Penguin Computing / Scyld Software
www.penguincomputing.com www.scyld.com
Annapolis MD and San Francisco CA
More information about the Beowulf
mailing list