[Beowulf] Re: Performance characterising a HPC application
Ashley Pittman
ashley at quadrics.com
Wed Apr 4 08:27:56 PDT 2007
stephen mulcahy wrote:
> Hi,
>
> As a follow on to my previous mail, I've gone ahead and run the Intel
> MPI Benchmarks (v3.0) on this cluster and gotten the following results -
> I'd be curious to know how they compare to other similar clusters.
You don't say what hardware you are using but from reading one of your
original mails "a 20 node opteron 270 (2.0GHz dual core, 4GB ram,
diskless) cluster with gigabit ethernet interconnect"
You should be aware that the PingPong and PingPing tests in IMB ping
messages between processes 0 and 1 in the job, it's likely that these
are on the same node so you are measuring shared memory performance.
Given it reports a bandwidth figure of 607MB/s I doubt very much it's
going over the ethernet. You should re-run these two tests using one
processer per node over two nodes to get the network speeds, you can
then use the ratio of nodes to cpus of a typical job to determine how
much weight to attach to the different results.
None of the figures you have posted strike me as being that great, I've
commented on some of the individual ones below.
> Also, I'm trying to determine which parts of the IMB results are most
> important for me
It depends on the application and what you use. PingPong is probably
most important.
> my understanding is that PingPong is a good measure
> of overall latency and bandwidth between individual nodes in the cluster.
Yes.
> Am I correct in thinking that Bcast and Reduce are good indicators of
> the performance of the cluster in terms of sending and receiving data
> from the head node to the compute nodes? My guess is that the other
> benchmarks are not as relevant to me since they measure performance for
> various types of inter-node traffic rather than the one-to-many pattern
> exhibited by my application.
It depends. If you are using one-to-many patterns in the app then code
it to use Bcast, creating the requisite sub communicators as needed and
then the Bcast benchmark worth looking at.
> #---------------------------------------------------
> # Intel (R) MPI Benchmark Suite V3.0, MPI-1 part
> #---------------------------------------------------
> #---------------------------------------------------
> # Benchmarking PingPong
> # #processes = 2
> # ( 78 additional processes waiting in MPI_Barrier)
> #---------------------------------------------------
> #bytes #repetitions t[usec] Mbytes/sec
> 0 1000 27.63 0.00
> 1 1000 28.56 0.03
> 2 1000 28.65 0.07
> 4 1000 28.72 0.13
> 8 1000 27.03 0.28
> 16 1000 28.63 0.53
> 32 1000 28.79 1.06
> 64 1000 28.72 2.13
> 128 1000 28.61 4.27
> 256 1000 28.75 8.49
> 512 1000 27.90 17.50
> 1024 1000 27.55 35.45
> 2048 1000 29.70 65.77
> 4096 1000 86.92 44.94
> 8192 1000 87.85 88.93
> 16384 1000 91.98 169.88
> 32768 1000 105.01 297.60
> 65536 640 149.88 417.00
> 131072 320 312.52 399.98
> 262144 160 547.92 456.27
> 524288 80 998.77 500.62
> 1048576 40 2008.35 497.92
> 2097152 20 3407.78 586.89
> 4194304 10 6583.70 607.56
*Assuming* this is for shared memory these are pretty dreadful figures,
for reference we get the following when we run the test intra-node, as
you can see they are strikingly different.
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 1.05 0.00
1 1000 1.03 0.92
2 1000 1.03 1.86
4 1000 1.09 3.49
8 1000 1.05 7.28
16 1000 1.03 14.83
32 1000 1.05 29.08
64 1000 1.16 52.61
128 1000 1.27 96.34
256 1000 1.44 169.75
512 1000 1.76 278.16
1024 1000 2.36 414.15
2048 1000 3.51 556.56
4096 1000 5.25 744.18
8192 1000 5.89 1326.33
16384 1000 7.85 1989.66
32768 1000 12.77 2447.58
65536 640 23.85 2620.20
131072 320 44.14 2831.98
262144 160 85.17 2935.16
524288 80 359.69 1390.10
1048576 40 1207.04 828.47
2097152 20 2407.75 830.65
4194304 10 4792.27 834.68
> #---------------------------------------------------
> # Benchmarking PingPing
> # #processes = 2
> # ( 78 additional processes waiting in MPI_Barrier)
> #---------------------------------------------------
As above really.
> #----------------------------------------------------------------
> # Benchmarking Reduce_scatter
> # #processes = 80
> #----------------------------------------------------------------
> #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
> 0 1000 1.15 1.17 1.15
> 4 1000 2.79 116.45 37.74
> 8 1000 2.88 146.80 28.30
> 16 1000 2.89 174.36 31.96
> 32 1000 3.23 302.12 45.65
> 64 1000 3.12 408.95 93.14
> 128 1000 3.07 576.51 232.83
> 256 1000 3.10 923.28 738.29
> 512 1000 1090.20 1092.25 1091.05
> 1024 1000 1212.27 1217.47 1214.91
> 2048 1000 1300.73 1306.18 1302.84
> 4096 1000 1474.26 1476.34 1475.58
> 8192 1000 1935.20 1936.17 1935.86
> 16384 1000 2562.77 2563.99 2563.63
> 32768 1000 3874.37 3876.27 3875.87
> 65536 640 73380.90 73387.62 73385.88
> 131072 320 350385.89 350418.67 350407.98
> 262144 160 12655.13 12828.11 12671.15
> 524288 80 23142.21 23554.27 23348.91
> 1048576 40 41440.35 41724.03 41584.49
> 2097152 20 58607.35 59571.41 59087.99
> 4194304 10 94975.40 100692.30 97813.92
There is something very wrong with these figures, you should never get a
jump in performance like that, also 2.79 for Reduce_scatter isn't
possible, I'd recompile IMB with checking on and check for errors first.
> #----------------------------------------------------------------
> # Benchmarking Bcast
> # #processes = 80
> #----------------------------------------------------------------
> #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
> 0 1000 0.06 0.10 0.07
> 1 1000 2559.63 2642.39 2577.71
> 2 1000 2577.59 2660.47 2592.78
> 4 1000 2541.65 2624.19 2561.87
> 8 1000 2538.07 2579.05 2556.78
> 16 1000 2540.35 2581.47 2558.62
> 32 1000 2539.32 2580.52 2557.10
> 64 1000 2539.23 2580.50 2555.27
> 128 1000 2539.12 2581.62 2553.66
> 256 1000 2585.36 2627.14 2588.85
> 512 1000 141.55 142.39 141.99
> 1024 1000 208.98 210.17 209.78
> 2048 1000 227.81 228.95 228.45
> 4096 1000 293.99 318.13 306.15
> 8192 1000 486.10 487.18 486.73
> 16384 1000 799.65 801.21 800.74
> 32768 1000 1483.64 1486.06 1485.42
> 65536 640 3192.47 3199.19 3197.68
> 131072 320 6314.70 6341.48 6335.46
> 262144 160 12469.44 12554.68 12532.73
> 524288 80 9770.92 10413.19 10318.85
> 1048576 40 18792.78 20762.40 20533.59
> 2097152 20 33849.45 42141.25 41535.32
> 4194304 10 65966.61 81472.99 79850.54
Why is 512 so much quicker than 8? The reduce figures showed the same
issue and it's something I'd want to look at.
Ashley,
More information about the Beowulf
mailing list