[Beowulf] Programming Help needed

Joshua mora acosta joshua_mora at usa.net
Fri Nov 6 15:29:58 PST 2009

Just try it and you'll understand what it means communication overhead....
most of these apps are network latency dominated: small messages but lots
because of i) many neighbor processors involved and iterative process.
Packing all the faces that need to be exchanges is the right way to go.
You can also think in having a dedicated thread for handling the
communications and the remaining ones for computation at the compute node
level. So you really get good overlapping of computation and commputation. 

------ Original Message ------
Received: 04:52 PM CST, 11/06/2009
From: amjad ali <amjad11 at gmail.com>
To: Beowulf Mailing List <beowulf at beowulf.org>
Subject: [Beowulf] Programming Help needed

> Hi all,
> I need/request some help from those who have some experience in
> debugging/profiling/tuning parallel scientific codes, specially for
> I have parallelized a Fortran CFD code to run on
> Ethernet-based-Linux-Cluster. Regarding MPI communication what I do is
> Suppose that the grid/mesh is decomposed for n number of processors, such
> that each processors has a number of elements that share their side/face
> with different processors. What I do is that I start non blocking MPI
> communication at the partition boundary faces (faces shared between any two
> processors) , and then start computing values on the internal/non-shared
> faces. When I complete this computation, I put WAITALL to ensure MPI
> communication completion. Then I do computation on the partition boundary
> faces (shared-ones). This way I try to hide the communication behind
> computation. Is it correct?
> IMPORTANT: Secondly, if processor A shares 50 faces (on 50 or less
> with an another processor B then it sends/recvs 50 different messages. So
> general if a processors has X number of faces sharing with any number of
> other processors it sends/recvs that much messages. Is this way has "very
> much reduced" performance in comparison to the possibility that processor A
> will send/recv a single-bundle message (containg all 50-faces-data) to
> process B. Means that in general a processor will only send/recv that much
> messages as the number of processors neighbour to it.  It will send a
> bundle/pack of messages to each neighbouring processor.
> Is their "quite a much difference" between these two approaches?

> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit

More information about the Beowulf mailing list