[Beowulf] Load Balance Shifts During Run of Fixed Balance Application [RESOLVED]
Michael H. Frese
Michael.Frese at NumerEx.com
Mon Mar 5 10:38:30 PST 2007
Thanks to those who took the time to consider my original description of
our problem. It has now been resolved and the simulation load balance is
remaining fixed over thousands of time steps.
The problem, not surprisingly, was in our application code, specifically in
our use of MPI in one particular place. We had posted some receives on the
originating processor -- which was also the output processor -- for
messages that were never sent. We failed to detect the error because -- in
another error -- we had failed to do a WaitAll on the receive message queue
for those messages. The result was that the originating/output processor
had an ever increasing receive queue to hunt through while pairing up
receives and arriving messages, and so took increasingly longer with each
successive timestep.
We also sent some messages to processors that did not exist, though I think
this was less of a problem.
We found the problem by looking for one a related kind. We built and ran a
test code, and found accidently that failing to post receives caused
processors to have to hunt through an increasing queue of received but
unprocessed messages.
Thanks again.
Mike
More information about the Beowulf
mailing list