[Beowulf] Re: debugging

Wed Apr 11 09:48:22 PDT 2007

 Matt Funk <mafunk at nmsu.edu> wrote:

> The reason i want to run on 32 processor though, is that it takes (on 32 
> procs) several hours till my program crashes. Also, i would like to be
able 
> to keep the conditions under which it crashes intact as much as possible 
> (i.e. run on 32 procs rather than 1).

> 
> Does anyone have any advice?

This is pretty general...

My advice is to be sure the code is absolutely as clean and standard
compliant as possible before you touch a debugger.  That means
add -Wall -pedantic -std=c99  (for gcc, or as appropriate for your
compiler) and don't stop until every bit of it compiles without a single
warning. Then run it through valgrind or the equivalent
and fix every memory problem it finds.  Then, and only then, try your
long run again.  If you're lucky this will fix the problem and you
won't have to debug anything.

Also a comment - if your program crashes pretty much by definition it
is not doing sufficient error checking.  Rather than "kaboom!" a well
written program will emit an "could not allocate memory" or "invalid
pointer" message and then exit gracefully.  Yes, I know that level of
error checking is often left out of inner loops for speed reasons.

Assuming that your code has a fairly fast cycle, so that several hours
represents many, many cycles, you're almost certainly looking for
either an invalid memory access, a memory leak, or running
some loop counter past the end of the loop (for instance, via an
unhandled condition.)  Valgrind can help you find some of these.  If
it does any file IO you might also be using up all the file descriptors.
(Saw that once in a version of NCBI BLAST, where it kept opening a gi
file and not closing it before the next open.)

If all of that fails, and you have easy access to another cluster with a
completely different architecture, try building and running there. 
Often a subtle problem on one CPU type stands out like a sore thumb
on another.

If all that fails, then Joe is probably right, start with the dumps and
work backwards to at least find out where in the code the crash is
taking place.  Or run each instance with strace, but be sure to log
the output to for each compute node to a local file on that node.  Then
you can put in print statements in the relevant locations and run again.
Just don't be surprised if, if the code is optimized, those print
statements themselves "cure" the problem.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech