[Beowulf] MPI performance on clusters of SMP
Kozin, I (Igor)
I.Kozin at dl.ac.uk
Thu Aug 26 09:22:19 PDT 2004
Nowadays clusters are typically built from SMP boxes.
Dual cpu nodes are common but quad and more available too.
Nevertheless I never saw that a parallel program runs quicker
on N nodes x 2 cpus than on 2*N nodes x 1 cpu
even if local memory bandwidth requirements are very modest.
The appearance is such that shared memory communication always
comes at an extra cost rather than as an advantage although
both MPICH and LAM-MPI have support for shared memory.
Any comments? Is this MPICH/LAM or Linux issue?
At least in one case I observed a hint towards the OS.
I experimented running several instances of a small program on
a 4-way Itanium2 Tiger box with 2.4 kernel. The program is
basically a loop over an array which fits into L1 cache.
Up to 3 instances finish virtually simultaneously.
If 4 instances are launched then 3 finish first and the 4th later
the overall time being about 40% longer.
More information about the Beowulf