[Beowulf] EM64T Clusters

Wed Jul 28 21:12:39 PDT 2004

On Wed, 28 Jul 2004, Bill Broadley wrote:

> > We've just brought up a test stand with both PCI-X and PCI-E Infiniband
> > host channel adapters.  Some very preliminary (and sketchy, sorry) test
> > results which will be updated occassionally are available at:
> >
> >    http://lqcd.fnal.gov/benchmarks/newib/
>
> Interesting, the listed:
>     * PCI Express: 4.5 microsec
>     * PCI-X, "HPC Gold": 7.4 microsec
>     * PCI-X, Topspin v2.0.0_531: 7.3 microsec
>
> Seem kind of slow to me, I suspect it's mostly the nodes (not pci-x).

I suspect that you're right.  Usually I've heard that I should see only
a 1 microsecond improvement in moving from PCI-X to PCI-Express.  The
numbers I'm reporting for the E7500 implementation of PCI-X are
consistent with what I measured last September on an E7501 cluster using
an older Netpipe (version 2.3) - see http://lqcd.fnal.gov/ib/.  My data
files from those runs show 7 microseconds, reported by Netpipe only to
that precision.  E7500/E7501 is getting pretty old, I suppose - these
nodes are 3 years old now, and if I recall correctly E7500 was the first
PCI-X chipset from Intel (i860 was just PCI 64/66, maybe?). E7500 was
also the first well-performing PCI bus from Intel after the terrible
PCI bandwidths on i840/i850/i860.

> I'm using dual opterons, PCI-X, and "HPC Gold" and getting 0.62 seconds:
>
> compute-0-0.local compute-0-1.local
> size=    1, 131072 hops, 2 nodes in  0.62 sec (  4.7 us/hop)    826 KB/sec
>
> My benchmark just does a MPI_Send<->MPI_Recv of a single integer,
> increments the integer it and passes it along in a circularly linked list
> of nodes.  What exact command line arguments did you use with netpipe
> I'd like to compare results.

I've added the commands used for each of the Netpipe runs shown in the
graphs to the web page (http://lqcd.fnal.gov/benchmarks/newib/).  All of
these runs are vanilla (no additional switches), except I suppose for
the "-t rdma_write" on the "verbs" run where bandwidth is greatly
improved versus the default.  I have the results of many other switch
combinations as well, but I haven't had a chance to digest them yet.

>
> > The PCI Express nodes are based on Abit AA8 motherboards, which have x16
> > slots.  We used the OpenIB drivers, as supplied by Mellanox in their
> > "HPC Gold" package, with Mellanox Infinihost III Ex HCA's.
> >
> > The PCI-X nodes are a bit dated, but still capable.  They are based on
> > SuperMicro P4DPE motherboards, which use the E7500 chipset.  We used
> > Topspin HCA's on these systems, with either the supplied drivers or the
> > OpenIB drivers.
> >
> > I've posted NetPipe graphs (MPI, rdma, and IPoIB) and Pallas MPI
> > benchmark results.  MPI latencies for the PCI Express systems were about
>
> Are the raw results for your netpipe runs available?

Yes.  I've added links to the raw results to the web page.

>
> > 4.5 microseconds; for the PCI-X systems, the figure was 7.3
> > microseconds.  With Pallas, sendrecv() bandwidths peaked at
> > approximately 1120 MB/sec on the PCI Express nodes, and about 620 MB/sec
>
> My pci-x nodes do about midway between those numbers:
> # Benchmarking Sendrecv
> #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
>  524288           80      1249.87      1374.87      1312.37       727.34
> 1048576           40      2499.78      2499.78      2499.78       800.07
> 2097152           20      4999.55      5499.45      5249.50       727.35
>
>
> > I don't have benchmarks for our application posted yet but will do so
> > once we add another pair of PCI-E nodes.
>
> I have 10 PCI-X dual opterons and should have 16 real soon if you want
> to compare Infiniband+pci-x on nodes that are closer to your pci-express
> nodes.

Yes, I would be very interested in lattice QCD application benchmarks on
your dual Opterons.  I should have access next week to about 16 dual
Xeon PCI Express nodes with Infiniband - the comparison should be very
enlightening.  Are you using libnuma?

Don Holmgren
Fermilab