[Beowulf] MPI performance on clusters of SMP
Robert G. Brown
rgb at phy.duke.edu
Fri Aug 27 06:25:51 PDT 2004
On Thu, 26 Aug 2004, Kozin, I (Igor) wrote:
> many thanks for your responce(s).
> I see. So all the cases I've seen must have the network
> bandwidth saturated (i.e. between a node and the switch).
> Should be possible to profile...
There are a number of tools out there that will permit you to monitor
network load, per interface, per node. xmlsysd/wulfstat for one, but
ganglia, various x apps, and a command line (e.g.)
netstat --interface=eth0 5
which is nearly equivalent to:
while [ -1 ]
head -2 /proc/net/dev
while [ $COUNT != 0 ]
COUNT=`expr $COUNT - 1`
grep eth0 /proc/net/dev
The only problem with these last two tools is that they display
absolute packet/byte counts. It is left as an exercise for the student
to convert this into e.g. perl and add code to extract deltas, divide by
the time, and form a rate.
Or use one of the tools that does it for you, of course...
> Thus using both cpus on a node creates even higher load on the
> connection. Hypothetically, when the memory bandwidth and
> the switch are not a problem then using N x 2 configuration
> with 2 network cards per node should be always superior to
> using 2*N x 1 config with 1 network card per node.
> (same number of cards and cpus!).
> PS As for my experiment with the Tiger box, it is perfectly
> reproducible and does not depend on the state of the system.
> I know that the chipset is not perfect and that's why I tried
> to fit everything in to cache.
> > Hi Igor,
> > the situation is rather complex. You compare a N nodes x 2
> > cpus with a 2
> > * N nodes x 1 cpu machine,
> > but you forget the number of network interfaces. In the first
> > case the 2
> > cpus share the network interface
> > and they share the memory too. And of course, in the first case, you
> > save money because you have
> > less network cards to buy... that's why cluster with 2 cpus
> > boxes are so
> > common.
> > And the 2 cpus boxes can be smp (intel) or ccnuma (opteron)
> > Then, it's difficult to predict if a N nodes x 2 cpus machine
> > performance is better than the 2 N * 1 cpu
> > solution for a given program. The better way is to do some tests !
> > For example, a MPI_Alltoall communication pattern should be more
> > effective on a 2 N * 1 cpu machine,
> > but it could be the inverse situation for a intensive MPI_Isend /
> > MPI_Irecv pattern...
> > For your tiger box problem, first you should know that the
> > intel chipset
> > is not very good,
> > then are you sure that no other program (like system activity) has
> > interfered with your measurments ?
> > regards,
> > Philippe Blaise
> > Kozin, I (Igor) wrote:
> > >Nowadays clusters are typically built from SMP boxes.
> > >Dual cpu nodes are common but quad and more available too.
> > >Nevertheless I never saw that a parallel program runs quicker
> > >on N nodes x 2 cpus than on 2*N nodes x 1 cpu
> > >even if local memory bandwidth requirements are very modest.
> > >The appearance is such that shared memory communication always
> > >comes at an extra cost rather than as an advantage although
> > >both MPICH and LAM-MPI have support for shared memory.
> > >
> > >Any comments? Is this MPICH/LAM or Linux issue?
> > >
> > >At least in one case I observed a hint towards Linux.
> > >I run several instances of a small program on a 4-way
> > Itanium2 Tiger box
> > >with 2.4 kernel. The program is basically
> > >a loop over an array which fits into L1 cache.
> > >Up to 3 instances finish virtually simultaneously.
> > >If 4 instances are launched then 3 finish first and the 4th later
> > >the overall time being about 40% longer.
> > >
> > >Igor
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
More information about the Beowulf