[Beowulf] Strange hardware? problems

Tue May 1 16:28:08 PDT 2007

On Tue, 1 May 2007, David Mathog wrote:

> Anyway, one caveat.  With the proliferation of x86 variants I now
> on occasion hit a binary which has been compiled for some other
> processor variant that blows up when it tries to use an instruction
> which is not supported on the processor it is actually running on.  As I
> mentioned previously, valgrind can catch these for you.  Or recompile
> using switches you know are supported on the target processor.

Completely agreed.  In fact in my case the problem was a mix of compiler
and the fact that I was using an Intel CPU which had an obscure
multiplication bug that was eventually worked around in the compiler.
The point is that EVEN if the problem turns out to be hardware or
compiler or something nominally "beyond your control", the solution is
always the same.  Instrument the hell out of your code, run it to
failure, accumulate data on the failure thereby, reinstrument to catch
the failure more tightly, iterate until you know that this run failed
between these two instructions and that the values of all possible loop
indices and variables at that time was the following vector of numbers
and that they all are/are not what they should be and if not they began
to diverge here, and here are the values of everything around THAT
point.  Then you may have to literally single step through the logic to
see where either you asked the machine to multiple six (and the variable
indeed contained six) times seven (and the variable indeed contained
seven) and the stupid computer returned forty-ONE.  Or where it returned
42 but three lines further on your index went out of bounds on the
dynamic array pointer and you overwrote 42 with 0xA321FD07 (random
garbage).

The worst bug I can recall ever having to squash in my own code was back
in my fortran days, where I had strong type checking and everything.  On
a single line in a program of several thousand lines I had typed an N
instead of an M in a program that used both (in fact N was absolute
value of M).  Where I did this both made sense, but N was not in fact
initialized and hence had a fixed value of zero.  Zero was a possible
value -- these were angular momentum indices -- and the function values
returned were not only plausible they were correct for certain values of
the input parameters to the overall program.  They where just wrong,
usually by a fixed factor, for others (and fortunately I had a pretty
strong idea of what right was).

EVEN instrumenting the code to where I literally single stepped through
the fortran -- and this was code on cards, so there was nothing like an
interactive debugger, mind -- I swear I stared at the code for close to
a week before the N/M typo finally jumped out of all those lines and
smacked me right between the eyes.

The point being don't expect the process to be easy.  At least you have
the advantage of having a high probability of failure.  The worst that
can happen to you is that as you instrument the code with output lines
the problem will disappear.  This, too, has happened to me on more than
one occassion -- changing the alignment of the code even in small ways
sometimes causes a failure to be missed if the problem is with pointers
and memory or a failure like the one David describes.  I actually
debugged a "heisenbug" like this at a different level in scripted code
on Friday.

A long and complex program was being started up on a dedicated server as
the last step in the boot process.  The program itself wrote extensively
to output during startup and was initiated via a fairly standard init.d
script that backgrounded the startup call.  The system was booting fine,
you could see the application startup occurring "successfully", but
post-boot it wasn't running!  Consistently.  If you started it by hand,
it came up perfectly.  When I started to instrument the startup script
to "watch" it start up, it suddenly started to come up during the boot.

It turned out to be a race condition between the program and the
completion of the rc boot process by init.  The program took five
seconds to start and wrote to stdout the whole way in the background.
When the rc script finished in the foreground, it went away taking the
backgrounded script's tty with it, so it crashed without a trace.
Running it by hand from a tty obviously worked fine as long as you
watched.  Running it by script in the boot worked fine as long as you
watched.  To get it to work WITHOUT watching, one had to either add a 6
second sleep to the startup script to wait for it to finish writing to
stdout (and hence get a /var/log/messages trace) or redirect stdout and
stderr to /dev/null (and lose it).  Or rewrite the code itself that was
being backgrounded to log directly and not write to stdout except in
debug mode... but it wasn't my code.  Or maybe even move the code up to
where at least five seconds worth of other startup remained before rc
boot completed.

Bugs can be SUBTLE.  Bugs can be heisenbugs like this that are other
people's "fault" (but your problem).  Be patient, be systematic, be
meditative and await Enlightenment.

      rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu