[Beowulf] debugging

Thu Apr 12 06:57:20 PDT 2007

On Mon, 2007-04-09 at 11:30 -0600, Matt Funk wrote:
> The reason i want to run on 32 processor though, is that it takes (on
> 32 procs) several hours till my program crashes. Also, i would like to
> be able to keep the conditions under which it crashes intact as much
> as possible (i.e. run on 32 procs rather than 1).
> 
> Does anyone have any advice? I am open to try out other things as well
> if possible. I am just starting to learn debugger techniques for a
> parallel 
> program.

What you are trying to do isn't uncommon, some of us do it most days.
having a job which exhibits the problem with only 32 procs and several
hours isn't a bad reproducer, I've certainly seen much worse.  Debugging
at this scale isn't exactly interactive but it's small enough to me able
to make timely progress.

My advice would be first and foremost to look at the core file, I assume
your program is receiving a SEGV and exiting?  core files can be
problematical, partly because they aren't always enabled and partly
because to extract anything useful out of them you need to run the
debugger with the same environment as the application was, this isn't
always as easy as it sounds if you are using modules or something like
that.

Often looking at the stack trace at the point of the crash will give you
a good clue as to where to look and most of the time further debugging
is a thought processes so no more tools are needed.

If you are having problems getting a stack trace in the normal way there
are two techniques that can be used, firstly you can write a wrapper
script to control the execution of your program, this can check the exit
status and if a core file was generated extract a core file from it
automatically and save it to disk somewhere.  This is useful because it
saves time and also is guaranteed to have the same environment as the
application so avoids the problem I mentioned above.  The other option
is to catch SEGV in the application and have the signal handler print a
message and spin allowing you to log-in and attach a debugger by hand,
this is often best for complex problems where you want more state than
can be automated but does require you to be present at the time of the
crash which isn't great for reproduces which take several days to run :(

printf() isn't nearly as useful in parallel applications as serial ones
as it's hard to strike the right balance between printing the
information you want and being drowned with information, multi-gigabyte
log files are far to easy to generate using this method although as you
close in on a bug printf does become more useful.  All to often however
simply adding printf changes the timing so much that bugs are no longer
reproducible.

As someone else mentioned Valgrind is a very useful tool, it should run
on most clusters (assuming you are on i686 or x86_64) and if it doesn't
send a mail to the valgrind-users list and I'm sure someone, quite
possibly me, will help you to get it running.  The downside is that it
will make your code quite a bit slower and increase memory usage so this
may not be an option for you but you should definitely try it if a
simple core dump isn't giving you enough information.

Other advice would be to set MALLOC_CHECK_=2 to enable integrity
checking in the libc malloc implementation and if using ia64 download
compile gdb from source otherwise you might find it's not all that
accurate at times.

TotalView and DDT are both great if you have a licence for either of
them although I must confess to not having used them for the situation
you describe.

Ashley,