[Beowulf] Re: Performance characterising a HPC application

Wed Apr 4 08:27:56 PDT 2007

stephen mulcahy wrote:
> Hi,
> 
> As a follow on to my previous mail, I've gone ahead and run the Intel
> MPI Benchmarks (v3.0) on this cluster and gotten the following results -
> I'd be curious to know how they compare to other similar clusters.

You don't say what hardware you are using but from reading one of your
original mails "a 20 node opteron 270 (2.0GHz dual core, 4GB ram,
diskless) cluster with gigabit ethernet interconnect"

You should be aware that the PingPong and PingPing tests in IMB ping
messages between processes 0 and 1 in the job, it's likely that these
are on the same node so you are measuring shared memory performance.
Given it reports a bandwidth figure of 607MB/s I doubt very much it's
going over the ethernet.  You should re-run these two tests using one
processer per node over two nodes to get the network speeds, you can
then use the ratio of nodes to cpus of a typical job to determine how
much weight to attach to the different results.

None of the figures you have posted strike me as being that great, I've
commented on some of the individual ones below.

> Also, I'm trying to determine which parts of the IMB results are most
> important for me

It depends on the application and what you use.  PingPong is probably
most important.

> my understanding is that PingPong is a good measure
> of overall latency and bandwidth between individual nodes in the cluster.

Yes.

> Am I correct in thinking that Bcast and Reduce are good indicators of
> the performance of the cluster in terms of sending and receiving data
> from the head node to the compute nodes? My guess is that the other
> benchmarks are not as relevant to me since they measure performance for
> various types of inter-node traffic rather than the one-to-many pattern
> exhibited by my application.

It depends.  If you are using one-to-many patterns in the app then code
it to use Bcast, creating the requisite sub communicators as needed and
then the Bcast benchmark worth looking at.

> #---------------------------------------------------
> #    Intel (R) MPI Benchmark Suite V3.0, MPI-1 part
> #---------------------------------------------------

> #---------------------------------------------------
> # Benchmarking PingPong
> # #processes = 2
> # ( 78 additional processes waiting in MPI_Barrier)
> #---------------------------------------------------
>        #bytes #repetitions      t[usec]   Mbytes/sec
>             0         1000        27.63         0.00
>             1         1000        28.56         0.03
>             2         1000        28.65         0.07
>             4         1000        28.72         0.13
>             8         1000        27.03         0.28
>            16         1000        28.63         0.53
>            32         1000        28.79         1.06
>            64         1000        28.72         2.13
>           128         1000        28.61         4.27
>           256         1000        28.75         8.49
>           512         1000        27.90        17.50
>          1024         1000        27.55        35.45
>          2048         1000        29.70        65.77
>          4096         1000        86.92        44.94
>          8192         1000        87.85        88.93
>         16384         1000        91.98       169.88
>         32768         1000       105.01       297.60
>         65536          640       149.88       417.00
>        131072          320       312.52       399.98
>        262144          160       547.92       456.27
>        524288           80       998.77       500.62
>       1048576           40      2008.35       497.92
>       2097152           20      3407.78       586.89
>       4194304           10      6583.70       607.56

*Assuming* this is for shared memory these are pretty dreadful figures,
for reference we get the following when we run the test intra-node, as
you can see they are strikingly different.

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         1.05         0.00
            1         1000         1.03         0.92
            2         1000         1.03         1.86
            4         1000         1.09         3.49
            8         1000         1.05         7.28
           16         1000         1.03        14.83
           32         1000         1.05        29.08
           64         1000         1.16        52.61
          128         1000         1.27        96.34
          256         1000         1.44       169.75
          512         1000         1.76       278.16
         1024         1000         2.36       414.15
         2048         1000         3.51       556.56
         4096         1000         5.25       744.18
         8192         1000         5.89      1326.33
        16384         1000         7.85      1989.66
        32768         1000        12.77      2447.58
        65536          640        23.85      2620.20
       131072          320        44.14      2831.98
       262144          160        85.17      2935.16
       524288           80       359.69      1390.10
      1048576           40      1207.04       828.47
      2097152           20      2407.75       830.65
      4194304           10      4792.27       834.68

> #---------------------------------------------------
> # Benchmarking PingPing
> # #processes = 2
> # ( 78 additional processes waiting in MPI_Barrier)
> #---------------------------------------------------

As above really.

> #----------------------------------------------------------------
> # Benchmarking Reduce_scatter
> # #processes = 80
> #----------------------------------------------------------------
>        #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
>             0         1000         1.15         1.17         1.15
>             4         1000         2.79       116.45        37.74
>             8         1000         2.88       146.80        28.30
>            16         1000         2.89       174.36        31.96
>            32         1000         3.23       302.12        45.65
>            64         1000         3.12       408.95        93.14
>           128         1000         3.07       576.51       232.83
>           256         1000         3.10       923.28       738.29
>           512         1000      1090.20      1092.25      1091.05
>          1024         1000      1212.27      1217.47      1214.91
>          2048         1000      1300.73      1306.18      1302.84
>          4096         1000      1474.26      1476.34      1475.58
>          8192         1000      1935.20      1936.17      1935.86
>         16384         1000      2562.77      2563.99      2563.63
>         32768         1000      3874.37      3876.27      3875.87
>         65536          640     73380.90     73387.62     73385.88
>        131072          320    350385.89    350418.67    350407.98
>        262144          160     12655.13     12828.11     12671.15
>        524288           80     23142.21     23554.27     23348.91
>       1048576           40     41440.35     41724.03     41584.49
>       2097152           20     58607.35     59571.41     59087.99
>       4194304           10     94975.40    100692.30     97813.92

There is something very wrong with these figures, you should never get a
jump in performance like that, also 2.79 for Reduce_scatter isn't
possible, I'd recompile IMB with checking on and check for errors first.

> #----------------------------------------------------------------
> # Benchmarking Bcast
> # #processes = 80
> #----------------------------------------------------------------
>        #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
>             0         1000         0.06         0.10         0.07
>             1         1000      2559.63      2642.39      2577.71
>             2         1000      2577.59      2660.47      2592.78
>             4         1000      2541.65      2624.19      2561.87
>             8         1000      2538.07      2579.05      2556.78
>            16         1000      2540.35      2581.47      2558.62
>            32         1000      2539.32      2580.52      2557.10
>            64         1000      2539.23      2580.50      2555.27
>           128         1000      2539.12      2581.62      2553.66
>           256         1000      2585.36      2627.14      2588.85
>           512         1000       141.55       142.39       141.99
>          1024         1000       208.98       210.17       209.78
>          2048         1000       227.81       228.95       228.45
>          4096         1000       293.99       318.13       306.15
>          8192         1000       486.10       487.18       486.73
>         16384         1000       799.65       801.21       800.74
>         32768         1000      1483.64      1486.06      1485.42
>         65536          640      3192.47      3199.19      3197.68
>        131072          320      6314.70      6341.48      6335.46
>        262144          160     12469.44     12554.68     12532.73
>        524288           80      9770.92     10413.19     10318.85
>       1048576           40     18792.78     20762.40     20533.59
>       2097152           20     33849.45     42141.25     41535.32
>       4194304           10     65966.61     81472.99     79850.54

Why is 512 so much quicker than 8?  The reduce figures showed the same
issue and it's something I'd want to look at.

Ashley,