[Beowulf] Strange hardware? problems

David Mathog mathog at caltech.edu
Tue May 1 12:27:46 PDT 2007


Robert G. Brown" <rgb at phy.duke.edu> wrote
> I've been coding, one way or another, for coming up on 35 years or
> thereabouts, starting with paper tape, going through cards (lots of
> cards), and up the evolutionary ladder.  In all of that time, I've
> encountered one -- count it, one -- time that a consistent error in code
> I was running was due to a real failure in the hardware I was running on
> and not a bug in my own code.

RGB has an extra 5 years on me, but my experience has been similar:
only very, very, very rarely is a program fault the result of a true
hardware issue.  (This excludes anything that runs from one box to
another over a cable or fiber, where hardware issues are more common.)
We once tracked a bug in an FFT subroutine running on an array
processor to faulty memory, and right down to a
memory pattern suggesting two address pins were shorted together.  On
opening the beast up, sure enough, the short was right where it had to
be, and it was repaired with a scalpel.  This was around 1982.

Anyway, one caveat.  With the proliferation of x86 variants I now
on occasion hit a binary which has been compiled for some other
processor variant that blows up when it tries to use an instruction
which is not supported on the processor it is actually running on.  As I
mentioned previously, valgrind can catch these for you.  Or recompile
using switches you know are supported on the target processor.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech



More information about the Beowulf mailing list