[Beowulf] New HPCC results, and an MX question

Tue Jul 19 17:55:55 PDT 2005

First off, I'd like to announce that we've started publishing public
benchmark data for InfiniPath; for example, we've now got a data point
listed at the HPC Challenge website:

http://icl.cs.utk.edu/hpcc/hpcc_results.cgi

In particular I'd like to point out our "Random Ring Latency" number
of 1.31 usec. This benchmark is a lot more realistic than the usual
ping-pong latency, because it uses all the cpus on all the nodes,
instead of just 1 cpu on each of 2 nodes. If you examine other
interconnects, you'll note that many of them get a much worse random
ring latency than ordinary ping-pong.

Second, I have a question about Myrinet MX performance. Myricom has
better things to do than answer my performance queries (no surprise,
every company prefers to answer customer queries first).  With GM,
Myricom published the raw output from the Pallas benchmark, and that
was very useful for doing comparisons. With MX, Myricom hasn't
published the raw data, but they did publish graphs. The claimed
0-byte latency is 2.6 usec, with no explanation of what benchmark was
used. The graph at:

http://www.myri.com/myrinet/performance/MPICH-MX/

for Pallas pingpong latency is a log/log scale, so it's hard to see
what latency it got without having the detailed results, which are not
provided. But if you look at the bandwidth chart, it's semi-log. So at
32 byte payloads, the bandwidth looks to me like it's 9 or 10
MB/s. That corresponds to a 3.1 to 3.4 usec 0-byte bandwidth. The
bandwidth for 64 bytes and 128 bytes seem to support this number, too.

So, the question is, am I full of it? Wait, don't answer that! The
question is, can someone using MX please run Pallas pingpong and
publish the raw chart?

To be fair, we don't have these details for InfiniPath up on our
website yet, so here's what we get on our 2.6 Ghz dual-cpu systems.
We're about 30 nanoseconds slower on this pingpong than the number
we get from the osu_latency pingpong.

-- greg

#---------------------------------------------------
# Benchmarking PingPong 
# ( #processes = 2 ) 
# ( 30 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         1.35         0.00
            1         1000         1.36         0.70
            2         1000         1.36         1.41
            4         1000         1.34         2.85
            8         1000         1.35         5.66
           16         1000         1.59         9.58
           32         1000         1.63        18.75
           64         1000         1.68        36.38
          128         1000         1.79        68.20
          256         1000         2.04       119.47
          512         1000         2.53       192.73
         1024         1000         3.51       277.86
         2048         1000         5.57       350.71
         4096         1000         7.46       523.45
         8192         1000        11.70       668.02
        16384         1000        21.49       727.14
        32768         1000        42.89       728.55
        65536          640        88.76       704.17
       131072          320       161.42       774.36
       262144          160       308.38       810.68
       524288           80       582.13       858.92
      1048576           40      1146.71       872.06
      2097152           20      2253.23       887.62
      4194304           10      4452.19       898.43