Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] MPI performance on clusters of SMP

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Robert G. Brown rgb at phy.duke.edu
Fri Aug 27 06:25:51 PDT 2004


On Thu, 26 Aug 2004, Kozin, I (Igor) wrote:

> Philippe,
> many thanks for your responce(s).
> 
> I see. So all the cases I've seen must have the network
> bandwidth saturated (i.e. between a node and the switch).
> Should be possible to profile...

There are a number of tools out there that will permit you to monitor
network load, per interface, per node.  xmlsysd/wulfstat for one, but
ganglia, various x apps, and a command line (e.g.) 

  netstat --interface=eth0 5

which is nearly equivalent to:

#!/bin/sh

while [ -1 ]
do
  head -2 /proc/net/dev
  COUNT=10
  while [ $COUNT != 0 ]
  do #
    COUNT=`expr $COUNT - 1`
    grep eth0 /proc/net/dev
    sleep 5
  done
done

The only problem with these last two tools is that they display
absolute packet/byte counts.  It is left as an exercise for the student
to convert this into e.g. perl and add code to extract deltas, divide by
the time, and form a rate.

Or use one of the tools that does it for you, of course...

   rgb

> Thus using both cpus on a node creates even higher load on the 
> connection. Hypothetically, when the memory bandwidth and 
> the switch are not a problem then using N x 2 configuration 
> with 2 network cards per node should be always superior to 
> using 2*N x 1 config with 1 network card per node.
> (same number of cards and cpus!).
> 
> Best,
> Igor
> 
> PS As for my experiment with the Tiger box, it is perfectly 
> reproducible and does not depend on the state of the system.
> I know that the chipset is not perfect and that's why I tried
> to fit everything in to cache.
> 
> > 
> > Hi Igor,
> > 
> > the situation is rather complex. You compare a N nodes x 2 
> > cpus with a 2 
> > * N nodes x 1 cpu machine,
> > but you forget the number of network interfaces. In the first 
> > case the 2 
> > cpus share the network interface
> > and they share the memory too. And of course, in the first case, you 
> > save money because you have
> > less network cards to buy... that's why cluster with 2 cpus 
> > boxes are so 
> > common.
> > And the 2 cpus boxes can be smp (intel) or ccnuma (opteron)
> > Then, it's difficult to predict if a N nodes x 2 cpus machine 
> > performance is better than the 2 N * 1 cpu
> > solution for a given program. The better way is to do some tests !
> > For example, a MPI_Alltoall communication pattern should be more 
> > effective on a 2 N * 1 cpu machine,
> > but it could be the inverse situation for a intensive MPI_Isend / 
> > MPI_Irecv pattern...
> > 
> > For your tiger box problem, first you should know that the 
> > intel chipset 
> > is not very good,
> > then are you sure that no other program (like system activity) has 
> > interfered with your measurments ?
> > 
> > regards,
> > 
> > Philippe Blaise
> > 
> > 
> > Kozin, I (Igor) wrote:
> > 
> > >Nowadays clusters are typically built from SMP boxes.
> > >Dual cpu nodes are common but quad and more available too.
> > >Nevertheless I never saw that a parallel program runs quicker 
> > >on N nodes x 2 cpus than on 2*N nodes x 1 cpu
> > >even if local memory bandwidth requirements are very modest.
> > >The appearance is such that shared memory communication always
> > >comes at an extra cost rather than as an advantage although
> > >both MPICH and LAM-MPI have support for shared memory.
> > >
> > >Any comments? Is this MPICH/LAM or Linux issue?
> > >
> > >At least in one case I observed a hint towards Linux.
> > >I run several instances of a small program on a 4-way 
> > Itanium2 Tiger box
> > >with 2.4 kernel. The program is basically 
> > >a loop over an array which fits into L1 cache.
> > >Up to 3 instances finish virtually simultaneously.
> > >If 4 instances are launched then 3 finish first and the 4th later
> > >the overall time being about 40% longer.
> > >
> > >Igor
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu





More information about the Beowulf mailing list