[Beowulf] Infiniband and multi-cpu configuration

Mon Feb 11 07:10:36 PST 2008

Guillaume Michal wrote:
 > Hi all,
 > We set up our first cluster in our faculty this week. As we are new to cluster computing, there is a lot to learn. We performed 
some linpack test using the OpenMPI benchmark available in the Rocks 4.3 distribution. The system is as follow:
 >  - GigB ethernet with switch HP Procurve 2800 series
 >  - 1 Master node: 500GB sata HDD, two intel quad core E5410 at 2.33GHz, 2GB mem
 >  - 4 nodes each having: 80GB sata HDD, two intel quad core E5410 at 2.33GHz, 8GB mem
 >
 > First I'm a bit confused by the parameters P and Q in HPL.dat and how to use them properly. I noticed a 4P 2Q test is not 
equivalent to a 2P 4Q, generally speaking it does not commute. Why? What is clearly P and Q then: P for number of processors per 
nodes and Q for the number of nodes?
 >

Visualize the problem as a big 2d matrix.  P and Q represent how the problem
is divided.  In general, the best is when the matrix is divided into even squares.
If your core count isn't n^2, then P and Q have to be different.  From experience
P should always be less than Q.  There may be a computational reason for that
(ie, longer strides in memory), but I am not sure.

 > Secondly, what is the definition of processor for a quad core architecture? I suppose a quad core should be counted as 4 processors.

Yes, unless you are using a multithreaded BLAS library.  If you are,
you should have each node be 1 process.

 >
 > I launched Linpack using Ns=10000 and various configuration for P and Q. At the moment I got a maximum of 78 Gflops using P=8 Q=4 
-> 32 processors.

You want to use as much available memory as possible.  I use N=10000 on a
single processor, single core run with 1GB.   You can figure out a good
value of N by the following formula:

Ns=sqrt(<Memory in Bytes per core>*<Number of cores>/8)

The 8 represents the size of a double.  For <Memory in Bytes per core>, I try
to use the largest number possible, typically about 90% of max.  You never
want to go into swap during these calculations (or, have it crash because
you have diskless nodes).

Ex: If you have 2GB per core for 32p, should use Ns as:

Ns=sqrt(1900*1024*1024*32/8)
Ns=89270

Honestly, this may be overkill.  At some point, the working memory set will
be large enough so that FP performance will be the bottleneck.  I would
start with smaller numbers (say half) and work your way up to understand
what is going on.  In any case, using Ns=10000 is way to small.

 >
 > If I'm right the peak performance should be Rpeak= 4 cores x 4 floting point op per cycle x 2.33 Ghz x 8 quad cores = 298 Gflops.
 > Which would lead to a test running at ~25% Rpeak.
 >
 > This is very low and I see 3 causes for the problem:
 >     - I miscalculated Rpeak
 >     - P and Q are not set properly
 >     - there is a serious bottelneck
 >

I think your Rpeak calculation is correct (not sure how many FPs the latest
Intel chips can do).

If increasing Ns doesn't help, run smaller cases on a per node bases (using
all available memory for each node).  If you don't get the exact same
answer on every node (or at least with 2%), you have a problem.  Figure out
what is wrong with the slow nodes.  Also, run the test multiple times
on the same node and verify consistent performance.

Craig

 > Thanks for your advices
 >
 > Guillaume
 >
 >
 > --Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
 >
 > _______________________________________________
 > Beowulf mailing list, Beowulf at beowulf.org
 > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
 >

-- 
Craig Tierney (craig.tierney at noaa.gov)