[Beowulf] Itanium2

Vincent Diepeveen diep at xs4all.nl
Thu Sep 15 17:24:26 PDT 2005

At 04:19 PM 9/12/2005 -0400, Mark Hahn wrote:
>> >Google for "superlinear speedup".  Most likely, as you split up your
>> >fixed problem size among more processors, more and more of it fits
>> >into the processor cache, where it runs much faster due to fewer main
>> >memory accesses.
>also google for "strong scaling" and contrast to "weak scaling".
>the former assumes a fixed problem size and a range of ncpus;
>the latter assumes a fixed problem *per* cpu.  I suspect you'll have 
>a hard time showing superlinear speedup under weak scaling ;)
>> This cache effect is quite profound on Altix since some of these have 
>> something like 9 MB cache per processor. You can see this result on 
>that's the irony: the it2 really works well when data is all in-cache,

I feel that the L1 and L2 of the I2 is very strong for what it's doing.
The Montecito improvements make sense, except that they are a year 
late with that montecito cpu, and i'll have to see at which speed it is

Montecito should have of course 8 cores or so to really kick butt for the
price it will cost.

I2's real problem is that it's IPC is too low for integer work. Effectively
there seems limits to really go above IPC > 2.0

For just raw gflops of course competition with special dedicated low
clocked and ultra cheap boxes full of cheapo gflop cpu's will be difficult
for a 
mega big cpu like I2, it's too big and eats too much power to ever compete
there real bigtime. 

The struggle should be in such a case for a cpu that's not only fast for
floating point but especially outperforms single cpu any other cpu at
integer workloads.

Seems they correct a lot with montecito if i hear about its design as it
can execute just like opteron big programs faster thanks to more instruction
cache available to serve quickly the utmost tiny L1. Right now that's if
i am correct all in L3 cache of Itanium2, which is a very weak performance
of it for big, especially integer, code sizes.

Additional another problem is that it regurarly is effectively performing
at 2 instructions per cycle this itanium2. Sure it seems to be able to do 4
add's within 1 cycle, but it has real weak behaviour at integer codes there.

It's ipc there is effectively not better than from opteron which is an OOO

It's easy to look back and say that OOO simply wins because it's cheaper to
make such a processor because you can put many cores onto 1 die with in
total like 1024KB L2, whereas itanium2 is completely build around its huge

That means that if you put many cores onto 1 die, that itanium is real
weak, because it NEEDS that huge L3 and it NEEDS that fast L1.

Effectively that means it's pretty hard in current technology to make high
clocked and quad core or even octo-core.

This where januari 2006 we will see already quad core opterons, which will
be very nice for those who execute integer work loads and it's genius of
course for workloads that do all kind of datastructure and in between only
now and then some floating point.

Itanium2 simply loses it there.

Montecito will be a major improvement in that respect, but simply too late.

Building a cluster from dual opteron quad cores will become real interesting,
for an all round performer that performs very well and is real cheap for
each node.

The only luck that itanium2 worshippers have is that there is not a single
manufacturer yet which produced a big SSI partition based upon its own
interconnects and a fast routing network from existing manufacturers.

If that would exist, i doubt how vital IPF would still be.

>From my viewpoint IPF looked like a genius solution a few years ago,
but it has a major problem when what you want to do is put a big bunch of
cores at 1 die.

Additional for programmers who want to optimize their code real well, 
IA64 has a major disadvantage that it is nearly impossible to write inline
assembly for IA64. Just optimizing a small part of your code, because
compiler isn't smarter, is real ugly in IA64, as you have to take too much
into account when writing assembly for it.

So when wanting the utmost performance out of it, x64 has won the battle in
the longterm from IA64 in that respect.


>or can somehow be prefetch+streamed so that cache misses don't happen.
>once you start missing, performance becomes unexceptional - you can 
>easily see this by looking at SpecFP results.  there, the it2's excellent
>scores is mainly due to extremely high results in the 2-3 very smallest
>benchmark components.
>around here, it's mainly serial monte-carlo jobs that are so small that 
>they're always in-cache.  so the "high-end" it2 (and expensive) is best
>suited for the lowest-end jobs...
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit

More information about the Beowulf mailing list