[Beowulf] GPU diagnostics?

Joe Landman landman at scalableinformatics.com
Mon Mar 30 15:31:17 PDT 2009

David Mathog wrote:
> Donald Becker wrote:
>> On Mon, 30 Mar 2009, David Mathog wrote:
>>> Joe Landman wrote:
>>>> Vendors have an nVidia supplied *GEMM based burn in test.  Been
> thinking 
>>>> about a set of diagnostics end users can run as a sanity check.
>>> My suspicion is that vendors run such burn in tests only for a very
>>> brief time.  That time being "the minimum time required to find the
>>> percentage of failed units above which it would cost us more if they
>>> were found to be bad in the field" - and not a second longer.
>> I don't know about other vendors, but that's not Penguin's approach.
> By "vendor" I meant graphics card vendors, not cluster or HPC vendors. 
> My interest in this sort of diagnostic arose in relation to an
> inexpensive graphics card bought at Newegg.  I was asking here
> specifically because it seemed likely that HPC vendors _would_ have
> the sort of GPU diagnostic I was seeking, and might be willing to share
> it.  (As opposed to the tool Joe referred to, which seems not to be
> generally available.)

FWIW, we agree with (and implement something similar to) Don's burn in 
procedure, and yes, it sometimes annoys customers who want it *now*. 
But it also (massively) reduces infant mortality rates (and we we have 
even designed new disk packaging to reduce the impact of the sometimes 
fatal disk malady named UPS/Fedex-osis).

This said, there really isn't a memory checker for GPUs just yet.  Could 
be done, and probably should be ...

Also, likely we should have a long term crunching diagnostic, where we 
already know the answer to a computational problem, and simply have it 
burn cycles.

But GPUs are more complex than this, we need to worry about PCIe bus 
transfers, several different flavors of memory, etc.

Really, since there is very little you can do if a GPU card is toast, 
other than replace it, it might be better to have the test done at this 

