[Beowulf] MPI performance on clusters of SMP

Thu Aug 26 11:22:04 PDT 2004

Philippe,
many thanks for your responce(s).

I see. So all the cases I've seen must have the network
bandwidth saturated (i.e. between a node and the switch).
Should be possible to profile...
Thus using both cpus on a node creates even higher load on the 
connection. Hypothetically, when the memory bandwidth and 
the switch are not a problem then using N x 2 configuration 
with 2 network cards per node should be always superior to 
using 2*N x 1 config with 1 network card per node.
(same number of cards and cpus!).

Best,
Igor

PS As for my experiment with the Tiger box, it is perfectly 
reproducible and does not depend on the state of the system.
I know that the chipset is not perfect and that's why I tried
to fit everything in to cache.

> 
> Hi Igor,
> 
> the situation is rather complex. You compare a N nodes x 2 
> cpus with a 2 
> * N nodes x 1 cpu machine,
> but you forget the number of network interfaces. In the first 
> case the 2 
> cpus share the network interface
> and they share the memory too. And of course, in the first case, you 
> save money because you have
> less network cards to buy... that's why cluster with 2 cpus 
> boxes are so 
> common.
> And the 2 cpus boxes can be smp (intel) or ccnuma (opteron)
> Then, it's difficult to predict if a N nodes x 2 cpus machine 
> performance is better than the 2 N * 1 cpu
> solution for a given program. The better way is to do some tests !
> For example, a MPI_Alltoall communication pattern should be more 
> effective on a 2 N * 1 cpu machine,
> but it could be the inverse situation for a intensive MPI_Isend / 
> MPI_Irecv pattern...
> 
> For your tiger box problem, first you should know that the 
> intel chipset 
> is not very good,
> then are you sure that no other program (like system activity) has 
> interfered with your measurments ?
> 
> regards,
> 
> Philippe Blaise
> 
> 
> Kozin, I (Igor) wrote:
> 
> >Nowadays clusters are typically built from SMP boxes.
> >Dual cpu nodes are common but quad and more available too.
> >Nevertheless I never saw that a parallel program runs quicker 
> >on N nodes x 2 cpus than on 2*N nodes x 1 cpu
> >even if local memory bandwidth requirements are very modest.
> >The appearance is such that shared memory communication always
> >comes at an extra cost rather than as an advantage although
> >both MPICH and LAM-MPI have support for shared memory.
> >
> >Any comments? Is this MPICH/LAM or Linux issue?
> >
> >At least in one case I observed a hint towards Linux.
> >I run several instances of a small program on a 4-way 
> Itanium2 Tiger box
> >with 2.4 kernel. The program is basically 
> >a loop over an array which fits into L1 cache.
> >Up to 3 instances finish virtually simultaneously.
> >If 4 instances are launched then 3 finish first and the 4th later
> >the overall time being about 40% longer.
> >
> >Igor