[Beowulf] Infiniband and multi-cpu configuration
Craig.Tierney at noaa.gov
Mon Feb 11 07:10:36 PST 2008
Guillaume Michal wrote:
> Hi all,
> We set up our first cluster in our faculty this week. As we are new to cluster computing, there is a lot to learn. We performed
some linpack test using the OpenMPI benchmark available in the Rocks 4.3 distribution. The system is as follow:
> - GigB ethernet with switch HP Procurve 2800 series
> - 1 Master node: 500GB sata HDD, two intel quad core E5410 at 2.33GHz, 2GB mem
> - 4 nodes each having: 80GB sata HDD, two intel quad core E5410 at 2.33GHz, 8GB mem
> First I'm a bit confused by the parameters P and Q in HPL.dat and how to use them properly. I noticed a 4P 2Q test is not
equivalent to a 2P 4Q, generally speaking it does not commute. Why? What is clearly P and Q then: P for number of processors per
nodes and Q for the number of nodes?
Visualize the problem as a big 2d matrix. P and Q represent how the problem
is divided. In general, the best is when the matrix is divided into even squares.
If your core count isn't n^2, then P and Q have to be different. From experience
P should always be less than Q. There may be a computational reason for that
(ie, longer strides in memory), but I am not sure.
> Secondly, what is the definition of processor for a quad core architecture? I suppose a quad core should be counted as 4 processors.
Yes, unless you are using a multithreaded BLAS library. If you are,
you should have each node be 1 process.
> I launched Linpack using Ns=10000 and various configuration for P and Q. At the moment I got a maximum of 78 Gflops using P=8 Q=4
-> 32 processors.
You want to use as much available memory as possible. I use N=10000 on a
single processor, single core run with 1GB. You can figure out a good
value of N by the following formula:
Ns=sqrt(<Memory in Bytes per core>*<Number of cores>/8)
The 8 represents the size of a double. For <Memory in Bytes per core>, I try
to use the largest number possible, typically about 90% of max. You never
want to go into swap during these calculations (or, have it crash because
you have diskless nodes).
Ex: If you have 2GB per core for 32p, should use Ns as:
Honestly, this may be overkill. At some point, the working memory set will
be large enough so that FP performance will be the bottleneck. I would
start with smaller numbers (say half) and work your way up to understand
what is going on. In any case, using Ns=10000 is way to small.
> If I'm right the peak performance should be Rpeak= 4 cores x 4 floting point op per cycle x 2.33 Ghz x 8 quad cores = 298 Gflops.
> Which would lead to a test running at ~25% Rpeak.
> This is very low and I see 3 causes for the problem:
> - I miscalculated Rpeak
> - P and Q are not set properly
> - there is a serious bottelneck
I think your Rpeak calculation is correct (not sure how many FPs the latest
Intel chips can do).
If increasing Ns doesn't help, run smaller cases on a per node bases (using
all available memory for each node). If you don't get the exact same
answer on every node (or at least with 2%), you have a problem. Figure out
what is wrong with the slow nodes. Also, run the test multiple times
on the same node and verify consistent performance.
> Thanks for your advices
> --Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Craig Tierney (craig.tierney at noaa.gov)
More information about the Beowulf