[Beowulf] EM64T Clusters
djholm at fnal.gov
Wed Jul 28 21:12:39 PDT 2004
On Wed, 28 Jul 2004, Bill Broadley wrote:
> > We've just brought up a test stand with both PCI-X and PCI-E Infiniband
> > host channel adapters. Some very preliminary (and sketchy, sorry) test
> > results which will be updated occassionally are available at:
> > http://lqcd.fnal.gov/benchmarks/newib/
> Interesting, the listed:
> * PCI Express: 4.5 microsec
> * PCI-X, "HPC Gold": 7.4 microsec
> * PCI-X, Topspin v2.0.0_531: 7.3 microsec
> Seem kind of slow to me, I suspect it's mostly the nodes (not pci-x).
I suspect that you're right. Usually I've heard that I should see only
a 1 microsecond improvement in moving from PCI-X to PCI-Express. The
numbers I'm reporting for the E7500 implementation of PCI-X are
consistent with what I measured last September on an E7501 cluster using
an older Netpipe (version 2.3) - see http://lqcd.fnal.gov/ib/. My data
files from those runs show 7 microseconds, reported by Netpipe only to
that precision. E7500/E7501 is getting pretty old, I suppose - these
nodes are 3 years old now, and if I recall correctly E7500 was the first
PCI-X chipset from Intel (i860 was just PCI 64/66, maybe?). E7500 was
also the first well-performing PCI bus from Intel after the terrible
PCI bandwidths on i840/i850/i860.
> I'm using dual opterons, PCI-X, and "HPC Gold" and getting 0.62 seconds:
> compute-0-0.local compute-0-1.local
> size= 1, 131072 hops, 2 nodes in 0.62 sec ( 4.7 us/hop) 826 KB/sec
> My benchmark just does a MPI_Send<->MPI_Recv of a single integer,
> increments the integer it and passes it along in a circularly linked list
> of nodes. What exact command line arguments did you use with netpipe
> I'd like to compare results.
I've added the commands used for each of the Netpipe runs shown in the
graphs to the web page (http://lqcd.fnal.gov/benchmarks/newib/). All of
these runs are vanilla (no additional switches), except I suppose for
the "-t rdma_write" on the "verbs" run where bandwidth is greatly
improved versus the default. I have the results of many other switch
combinations as well, but I haven't had a chance to digest them yet.
> > The PCI Express nodes are based on Abit AA8 motherboards, which have x16
> > slots. We used the OpenIB drivers, as supplied by Mellanox in their
> > "HPC Gold" package, with Mellanox Infinihost III Ex HCA's.
> > The PCI-X nodes are a bit dated, but still capable. They are based on
> > SuperMicro P4DPE motherboards, which use the E7500 chipset. We used
> > Topspin HCA's on these systems, with either the supplied drivers or the
> > OpenIB drivers.
> > I've posted NetPipe graphs (MPI, rdma, and IPoIB) and Pallas MPI
> > benchmark results. MPI latencies for the PCI Express systems were about
> Are the raw results for your netpipe runs available?
Yes. I've added links to the raw results to the web page.
> > 4.5 microseconds; for the PCI-X systems, the figure was 7.3
> > microseconds. With Pallas, sendrecv() bandwidths peaked at
> > approximately 1120 MB/sec on the PCI Express nodes, and about 620 MB/sec
> My pci-x nodes do about midway between those numbers:
> # Benchmarking Sendrecv
> #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
> 524288 80 1249.87 1374.87 1312.37 727.34
> 1048576 40 2499.78 2499.78 2499.78 800.07
> 2097152 20 4999.55 5499.45 5249.50 727.35
> > I don't have benchmarks for our application posted yet but will do so
> > once we add another pair of PCI-E nodes.
> I have 10 PCI-X dual opterons and should have 16 real soon if you want
> to compare Infiniband+pci-x on nodes that are closer to your pci-express
Yes, I would be very interested in lattice QCD application benchmarks on
your dual Opterons. I should have access next week to about 16 dual
Xeon PCI Express nodes with Infiniband - the comparison should be very
enlightening. Are you using libnuma?
More information about the Beowulf