[Beowulf] GPU diagnostics?

Joe Landman landman at scalableinformatics.com
Mon Mar 30 10:10:17 PDT 2009

David Mathog wrote:
> Have any of you CUDA folks produced diagnostic programs you run during
> "burn in" of new GPU based systems, in order to weed out problem units
> before putting them into service?  Minimally,  something resembling
> memtest86, to be used to find buggy memory associated with the GPU?
> Optimally, it would also more directly exercise the GPU's capabilities.
> I asked on the NV linux forum if there were any official Nvidia graphics
> card diagnostic programs, and nobody there answered with one.  This was
> originally with respect to some VDPAU issues, where it looked at first
> like there might be a hardware problem on a small set of systems,
> including mine, although in the end it turned out to be an uninitialized
> variable (it was not my code).   There was no objective way to
> demonstrate for VDPAU based software that "this graphics card is
> functioning normally" to help sort this out.  I figured the CUDA folks
> should have something like this, else how could you trust the results
> from the GPU calculations?

Vendors have an nVidia supplied *GEMM based burn in test.  Been thinking 
about a set of diagnostics end users can run as a sanity check.

