[Beowulf] Strange hardware? problems

Robert G. Brown rgb at phy.duke.edu
Mon Apr 30 15:57:36 PDT 2007

On Mon, 30 Apr 2007, Orion Poplawski wrote:

> I do mean accuracy, and not necessarily subtle - things blow up bad. Perform 
> some set of calculations over and over and error if it doesn't give the 
> expected result.

Again, this sounds like a bug, not a hardware problem.  If it occurs on
two completely different systems I'd suspect a plain old programming or
systems bug until proven otherwise.  There are so many of these possible
that it is difficult to know where to start.  Program bessel functions
wrong (recursion in the wrong direction) and you end up with garbage.
Program incomplete gamma functions wrong ditto.  Use single precision
where you should be using double, do a sum in the wrong order, overwrite
the boundary of an array...

The fact that you got/get a segment violation in one of your runs (IIRC)
is pretty much telling you "hey, you're writing out of bounds in memory
somewhere".  This can be a simple case of somebody not doing bounds
checking in a program and your using the program with inputs they didn't
expect or in some uncommonly traced way.

The reason to use open source code for doing serious work like this is
that you can debug it.  Everybody has their own methodology, but as I
think Greg mentioned recently in a different context, ultimately it
involves instrumenting your code with output statements and tracing it
through one or more crashes.  How difficult this ends up being depends
on who wrote the code, how well commented it is, whether or not you can
make a clever guess as to where the problem "probably" lies.

I've been coding, one way or another, for coming up on 35 years or
thereabouts, starting with paper tape, going through cards (lots of
cards), and up the evolutionary ladder.  In all of that time, I've
encountered one -- count it, one -- time that a consistent error in code
I was running was due to a real failure in the hardware I was running on
and not a bug in my own code.  And that was on crap hardware.  I cannot
begin to count the number of times that I've discovered bugs in my own
code, including ones that were so subtle that I SWORE that the computer
was making a mistake -- until I discovered my own.

The question you should be asking, then, isn't "what could make my
program fail".  Nobody can answer that from far away.  There a near
infinity of possible causes.  The right question is "how can I FIGURE
OUT why my program is failing?"  And the answer is, slowly and
deliberately, by presuming a bug in the code and instrumenting the code
until you can can "see" the failure occurring and completely understand
where and why it fails.  Nothing else works.  Seriously.



Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

More information about the Beowulf mailing list