[Beowulf] itanium vs. x86-64

Tue Feb 10 12:52:04 PST 2009

I've got a zx2000 (1.5 GHZ/6 MB Madison processor, 2 GB PC2100 RAM, general 
system details at http://www.openpa.net/systems/hp_zx2000.html) that I use 
for testing and benchmarking. Obviously there some difference in performance 
characteristics between this machine and a gazillion-processor Altix, but 
it's usually not too far off. If there's any code you want tested feel free 
to email me (replace spambox with michael if you think your email will upset 
SpamAssassin). It's running Debian with ICC 10.1 20080801. It's also got GCC 
4.1.2, but IME using GCC instead of ICC on IA64 results in somewhat reduced 
performance, to say the least.

For a couple of reference points, here's some numbers against an Opteron 
1210 machine (1.8 GHz/2x1 MB dual core, 2 GB PC2-6400) with GCC 4.3.2 
running Solaris 10 (32-bit mode), and a Core 2 Q6600 (2.4 GHz/2x4 MB quad 
core, 8 GB PC6400) with Visual Studio 2005 running Windows (64-bit mode). 
Note that these are as much a test of compilers as a test of architectures. 
I've spent a bit of effort tuning the compilers (and adjusting the code to 
help the compilers), but someone who really knows their stuff can probably 
get a bit more oomph out of them.

The first test is my Mersenne Twister based PRNG library, which uses the 
Ziggurat algorithm for Gaussian and exponential distributions. On x86, it's 
accelerated using SSE (or MMX, if there's no SSE), though falls back to 
tweaked C++ code (many register 64-bit, many register 32 bit, and register 
constrained 32-bit for things like x86). It's mostly integer code, though 
things like the Gaussian tail distribution are transcendental-fp limited 
(sqrt, log). The test involves filling a buffer of 1000 samples, with the 
returned number being in processor cycles per sample. I've included SSE and 
non-SSE numbers for the Opteron and Core 2 machines, and both 32-bit and 
64-bit mode for the Core 2. The PRNG benchmark only uses a single thread, so 
gains no speedup on multicore machines.

Test               Itanium Opteron       Core 2 64bit  Core 2 32bit
Variant                       SSE No SSE    SSE No SSE    SSE No SSE
Uniform u32          3.72    5.02  24.20   3.07   4.51   3.14   9.85
Uniform fp32        14.90    5.11  27.12   2.89   6.39   3.02  12.65
Uniform fp64        11.43   10.39  70.72   5.71  11.20   5.81  36.55
Gaussian tail fp64 113.58  225.82 324.47 197.30 188.54 108.73 254.24
Gaussian fp64       34.30   49.73  96.43  26.47  24.54  25.80  58.81
Exponential fp32    28.59   29.50  47.58  17.22  15.18  19.71  27.64
Exponential fp64    45.04   53.55 101.23  29.02  25.07  30.46  59.44

Cycle for cycle, the Itanium holds it's ground more or less against the 
Opteron. Against the Core 2 in 64-bit mode with SSE, it gets thumped pretty 
bad in everything except the Gaussian tail distribution. This is mainly due 
to the sheer amount of integer grunt that's available on the Core 2 when you 
fully use SSE2. Even the Opteron, executing pretty much exactly the same 
code as the Core 2, can't keep up since the SSE integer units are only half 
as wide there. I'd expect a quad-code (Barcelona) Opteron would have a much 
better showing here.

If you take away the advantage of hand-optimised SSE, but stay in 64 bit 
mode, the Itanium is a bit more competitive. Though it still only manages to 
beat the Core 2 in one distribution (uniform u32). Unfortunately, I can't 
run the Opteron box in 64-bit mode, so no results there.

Finally, if you kick the x86 machines back into 32-bit mode and take away 
the tuned SSE code, the shortage of registers takes the legs out from under 
the compilers. The Itanium takes the crown in nearly all the tests, and GCC 
on the Opteron simply implodes.

The second test is a monte-carlo raytracer that tracks the path of ions 
through a gas-filled solenoid, simulating interactions (scattering and 
charge exchange) between the ion and the gas. At the core it's a 4th/5th 
adaptive Runge-Kutta-Fehlberg integrator that does bilinear sampling for the 
magnetic field, and uses the above PRNG library to sample for scattering and 
the charge exchange events. It's primarily fp limited, since the working set 
is very small. It is minimally SSE-accelerated, since both GCC and ICC make 
a complete mess of the autovectorization and I haven't had time to go in and 
do it all by hand. The main RKF calculations are not vectorized. It also can 
do GPGPU acceleration using DX10, but I'm leaving that out here. Time is in 
seconds for 1000 ions (with gas) or 200000 ions (without gas), rounded to 
the nearest 100th, and since it's monte-carlo it obviously scales linearly 
with the number of cores.

Test        Itanium Opteron      Core 2 64bit Core 2 32bit
Variant               SSE No SSE  SSE No SSE   SSE No SSE
With gas      23.50 11.52  12.32 3.25   3.40  5.77   4.67
Without gas   20.41 10.03  10.51 3.48   3.35  5.13   5.69

Obviously, with the most cores and the most raw clock speed, the Core 2 
completely dominates. Scaled to a 1 GHz single core (ie: multiplied by the 
number of cores and the speed in GHz) gives a bit more of an idea of 
efficiency:

Test        Itanium Opteron      Core 2 64bit Core 2 32bit
Variant               SSE No SSE   SSE No SSE   SSE No SSE
With gas      35.25 41.47  44.35 31.20  32.64 55.39  44.83
Without gas   30.62 36.11  37.84 33.41  32.16 49.25  54.62

When MADDs are the bulk of the work (no gas, so basic RKF integration) the 
Itanium comes out on top by a hair. The Core 2, or more likely the MSVC 
compiler, really struggles in 32 bit mode, though can come close to the 
Itanium in 64-bit mode. SSE only has a minimal (in fact, negative for the 
Core 2 in 64-bit mode) effect here since most of the code doesn't use it. 
When the gas interactions are added, the Itanium drops back behind the 
64-bit Core 2 but still comes in front of the Opteron in 32-bit mode. It 
would have been interesting to see how the Opteron did in 64-bit mode.

So for these two (admittedly rather limited in scope) tests, the Itanium is 
relatively competitive on a clock-for-clock bases to a Core 2 in 64-bit mode 
in floating-point dominated tests where the integration kernel hasn't been 
vectorized using SSE. Once the workload becomes a bit more branchy and 
integer-heavy, it drops behind slightly. In situations that have had a lot 
of SSE tuning, though, such as the PRNG code, the Core 2 really dominates.

Of course, clock for clock doesn't help all that much when the top-end Core 
2 is running about twice as fast as the top-end Itanium, and is much 
cheaper. And this is the basic problem for the Itanium - the top speed bin 
of the Itanium has only gone up 166 MHz (11%) since June 2003, and core IPC 
hasn't gone up much either. All that's changed from a performance point of 
view is that there's more cache, a faster bus, and more cores per socket. 
This obviously has some benefit to more memory-hungry software, but you have 
to wonder how well a Nehalem would do if you gave it a similar amount of 
cache.

The main thing I've seen going for the Itanium in HPC is SGI's NUMALink. A 
colleague of mine is developing some quantum mechanics simulation stuff, and 
scaling on the ANU Altix is great. Scaling on a Woodcrest Xeon cluster using 
Infiniband ... poor to the point of almost not worth going outside a single 
node. Hopefully, with the Nehalem and Tukwila sharing the same socket we 
might be able to get NUMALinked Nehalems, which would really throw a 
curveball into the HPC interconnect market.

Cheers,
Michael