[Beowulf] Programming Help needed

amjad ali amjad11 at gmail.com
Fri Nov 6 14:43:38 PST 2009

Hi all,

I need/request some help from those who have some experience in
debugging/profiling/tuning parallel scientific codes, specially for

I have parallelized a Fortran CFD code to run on
Ethernet-based-Linux-Cluster. Regarding MPI communication what I do is that:

Suppose that the grid/mesh is decomposed for n number of processors, such
that each processors has a number of elements that share their side/face
with different processors. What I do is that I start non blocking MPI
communication at the partition boundary faces (faces shared between any two
processors) , and then start computing values on the internal/non-shared
faces. When I complete this computation, I put WAITALL to ensure MPI
communication completion. Then I do computation on the partition boundary
faces (shared-ones). This way I try to hide the communication behind
computation. Is it correct?

IMPORTANT: Secondly, if processor A shares 50 faces (on 50 or less elements)
with an another processor B then it sends/recvs 50 different messages. So in
general if a processors has X number of faces sharing with any number of
other processors it sends/recvs that much messages. Is this way has "very
much reduced" performance in comparison to the possibility that processor A
will send/recv a single-bundle message (containg all 50-faces-data) to
process B. Means that in general a processor will only send/recv that much
messages as the number of processors neighbour to it.  It will send a single
bundle/pack of messages to each neighbouring processor.
Is their "quite a much difference" between these two approaches?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091106/f5b64c9a/attachment.html>

More information about the Beowulf mailing list