[Beowulf] itanium vs. x86-64
Michael Brown
spambox at emboss.co.nz
Tue Feb 10 12:52:04 PST 2009
I've got a zx2000 (1.5 GHZ/6 MB Madison processor, 2 GB PC2100 RAM, general
system details at http://www.openpa.net/systems/hp_zx2000.html) that I use
for testing and benchmarking. Obviously there some difference in performance
characteristics between this machine and a gazillion-processor Altix, but
it's usually not too far off. If there's any code you want tested feel free
to email me (replace spambox with michael if you think your email will upset
SpamAssassin). It's running Debian with ICC 10.1 20080801. It's also got GCC
4.1.2, but IME using GCC instead of ICC on IA64 results in somewhat reduced
performance, to say the least.
For a couple of reference points, here's some numbers against an Opteron
1210 machine (1.8 GHz/2x1 MB dual core, 2 GB PC2-6400) with GCC 4.3.2
running Solaris 10 (32-bit mode), and a Core 2 Q6600 (2.4 GHz/2x4 MB quad
core, 8 GB PC6400) with Visual Studio 2005 running Windows (64-bit mode).
Note that these are as much a test of compilers as a test of architectures.
I've spent a bit of effort tuning the compilers (and adjusting the code to
help the compilers), but someone who really knows their stuff can probably
get a bit more oomph out of them.
The first test is my Mersenne Twister based PRNG library, which uses the
Ziggurat algorithm for Gaussian and exponential distributions. On x86, it's
accelerated using SSE (or MMX, if there's no SSE), though falls back to
tweaked C++ code (many register 64-bit, many register 32 bit, and register
constrained 32-bit for things like x86). It's mostly integer code, though
things like the Gaussian tail distribution are transcendental-fp limited
(sqrt, log). The test involves filling a buffer of 1000 samples, with the
returned number being in processor cycles per sample. I've included SSE and
non-SSE numbers for the Opteron and Core 2 machines, and both 32-bit and
64-bit mode for the Core 2. The PRNG benchmark only uses a single thread, so
gains no speedup on multicore machines.
Test Itanium Opteron Core 2 64bit Core 2 32bit
Variant SSE No SSE SSE No SSE SSE No SSE
Uniform u32 3.72 5.02 24.20 3.07 4.51 3.14 9.85
Uniform fp32 14.90 5.11 27.12 2.89 6.39 3.02 12.65
Uniform fp64 11.43 10.39 70.72 5.71 11.20 5.81 36.55
Gaussian tail fp64 113.58 225.82 324.47 197.30 188.54 108.73 254.24
Gaussian fp64 34.30 49.73 96.43 26.47 24.54 25.80 58.81
Exponential fp32 28.59 29.50 47.58 17.22 15.18 19.71 27.64
Exponential fp64 45.04 53.55 101.23 29.02 25.07 30.46 59.44
Cycle for cycle, the Itanium holds it's ground more or less against the
Opteron. Against the Core 2 in 64-bit mode with SSE, it gets thumped pretty
bad in everything except the Gaussian tail distribution. This is mainly due
to the sheer amount of integer grunt that's available on the Core 2 when you
fully use SSE2. Even the Opteron, executing pretty much exactly the same
code as the Core 2, can't keep up since the SSE integer units are only half
as wide there. I'd expect a quad-code (Barcelona) Opteron would have a much
better showing here.
If you take away the advantage of hand-optimised SSE, but stay in 64 bit
mode, the Itanium is a bit more competitive. Though it still only manages to
beat the Core 2 in one distribution (uniform u32). Unfortunately, I can't
run the Opteron box in 64-bit mode, so no results there.
Finally, if you kick the x86 machines back into 32-bit mode and take away
the tuned SSE code, the shortage of registers takes the legs out from under
the compilers. The Itanium takes the crown in nearly all the tests, and GCC
on the Opteron simply implodes.
The second test is a monte-carlo raytracer that tracks the path of ions
through a gas-filled solenoid, simulating interactions (scattering and
charge exchange) between the ion and the gas. At the core it's a 4th/5th
adaptive Runge-Kutta-Fehlberg integrator that does bilinear sampling for the
magnetic field, and uses the above PRNG library to sample for scattering and
the charge exchange events. It's primarily fp limited, since the working set
is very small. It is minimally SSE-accelerated, since both GCC and ICC make
a complete mess of the autovectorization and I haven't had time to go in and
do it all by hand. The main RKF calculations are not vectorized. It also can
do GPGPU acceleration using DX10, but I'm leaving that out here. Time is in
seconds for 1000 ions (with gas) or 200000 ions (without gas), rounded to
the nearest 100th, and since it's monte-carlo it obviously scales linearly
with the number of cores.
Test Itanium Opteron Core 2 64bit Core 2 32bit
Variant SSE No SSE SSE No SSE SSE No SSE
With gas 23.50 11.52 12.32 3.25 3.40 5.77 4.67
Without gas 20.41 10.03 10.51 3.48 3.35 5.13 5.69
Obviously, with the most cores and the most raw clock speed, the Core 2
completely dominates. Scaled to a 1 GHz single core (ie: multiplied by the
number of cores and the speed in GHz) gives a bit more of an idea of
efficiency:
Test Itanium Opteron Core 2 64bit Core 2 32bit
Variant SSE No SSE SSE No SSE SSE No SSE
With gas 35.25 41.47 44.35 31.20 32.64 55.39 44.83
Without gas 30.62 36.11 37.84 33.41 32.16 49.25 54.62
When MADDs are the bulk of the work (no gas, so basic RKF integration) the
Itanium comes out on top by a hair. The Core 2, or more likely the MSVC
compiler, really struggles in 32 bit mode, though can come close to the
Itanium in 64-bit mode. SSE only has a minimal (in fact, negative for the
Core 2 in 64-bit mode) effect here since most of the code doesn't use it.
When the gas interactions are added, the Itanium drops back behind the
64-bit Core 2 but still comes in front of the Opteron in 32-bit mode. It
would have been interesting to see how the Opteron did in 64-bit mode.
So for these two (admittedly rather limited in scope) tests, the Itanium is
relatively competitive on a clock-for-clock bases to a Core 2 in 64-bit mode
in floating-point dominated tests where the integration kernel hasn't been
vectorized using SSE. Once the workload becomes a bit more branchy and
integer-heavy, it drops behind slightly. In situations that have had a lot
of SSE tuning, though, such as the PRNG code, the Core 2 really dominates.
Of course, clock for clock doesn't help all that much when the top-end Core
2 is running about twice as fast as the top-end Itanium, and is much
cheaper. And this is the basic problem for the Itanium - the top speed bin
of the Itanium has only gone up 166 MHz (11%) since June 2003, and core IPC
hasn't gone up much either. All that's changed from a performance point of
view is that there's more cache, a faster bus, and more cores per socket.
This obviously has some benefit to more memory-hungry software, but you have
to wonder how well a Nehalem would do if you gave it a similar amount of
cache.
The main thing I've seen going for the Itanium in HPC is SGI's NUMALink. A
colleague of mine is developing some quantum mechanics simulation stuff, and
scaling on the ANU Altix is great. Scaling on a Woodcrest Xeon cluster using
Infiniband ... poor to the point of almost not worth going outside a single
node. Hopefully, with the Nehalem and Tukwila sharing the same socket we
might be able to get NUMALinked Nehalems, which would really throw a
curveball into the HPC interconnect market.
Cheers,
Michael
More information about the Beowulf
mailing list