[Beowulf] GPU diagnostics?

Lux, James P james.p.lux at jpl.nasa.gov
Mon Mar 30 16:38:30 PDT 2009

> > Finding
> > marginal memory, certainly one of the easier tests, can 
> easily take 24 
> > hours of testing.
> And typically those memory modules test OK in a tester, even 
> after being pulled from a machine showing memory errors.  
> (That's not surprising, since most distributors test modules 
> just before shipping them, and they are tested again just 
> before installation.)

I suspect that the problem is not a "memory" problem per-se, but some other aspect.. Maybe a marginal timing thing on the bus.  A lot of "memory tester" boxes basically just test that the memory is functional (i.e. you can read and write all locations at the rated speed). 

Looking at products from http://www.memorytest.com/ (which happened to be the first google hit) it looks like it does a basic functional test, but, in it's normal stock configuration, doesn't exercise the parts at the timing margins (i.e. drive it with setup and hold times at minimums, or perhaps the worst case transition timing).  Nor does a simple tester really test whether the logic level voltage tolerances are what they should be (i.e. is the "eye" as open as it should be)

The tester here http://www.microtestsystem.com/rs800-166.html seems to be able to just step the timing in suitable multiples of the basic clock rate (e.g. 2,3,4 clocks for Trd), but doesn't check to see if, maybe, the part stops working at 1.9 clocks.  But hey, it DOES have a "heavy duty test start" button, which would be important!

We won't even get into the possibility of latent ESD damage from handling.

More information about the Beowulf mailing list