[Beowulf] itanium vs. x86-64

Mon Feb 9 17:18:19 PST 2009

On Feb 9, 2009, at 11:45 PM, Mark Hahn wrote:

>> I have been working on itanium machines for a year now and I  
>> actually found
>> the hw pretty elegant and the dev software stack on top of it  
>> (compiler,
>> profiler etc) pretty handy.
>

And at how

> aren't all the same tools available on x86_64?  or were you  
> referring to, eg,
> something SGI-specific?
>
>> but now with the Tukwila switching to the QuickPath, how do you  
>> guys think
>> Itanium will perform in comparison to Xeon's and Opteron's ?
>
> this change would be interesting if it meant that the next-gen  
> numalink
> box could take nehalems rather than ia64.  I can't really  
> understand why Intel has stuck with ia64 this long - perhaps the  
> economy will provide
> the fig-leaf necessary to dump it.
>

Of course there is a few points to adress:

a) SMT works a lot better at in order cores than HT at out of order.

      Usually out of order, prime number FFT (DWT actually) type  
software seems to profit about 5% there.
      Branchy and memory dependant codes seem optimistically be able  
to get around 10-20% scaling improvement.

      The big difference is that at in order cores, see power6, you  
can basically hide the branch misprediction penalties,
       which is not possible at x64

This was of course one of the many promises of in order cores.

b) intel always ran behind in process technology with itanium and  
never managed to clock it very high. IBM didn't make this mistake  
with power6,
they clocked it to 5Ghz. Intel always ran behind in clockspeed with  
itanium. Now you can take that for granted. You can also say: "heh  
why am i paying $7500 for a 1.5Ghz itanium2 processor (at its release  
and for quite some period of time it was priced like that) and why  
has it been clocked that much lower, and is it process technologies  
behind on the rest?"

You can say things like: "it needs more verification than cheapo x86  
cpu's". However that excuse is only valid for 6 months and no longer  
than that.
After 6 months it HAS to have the same process technology like x86  
cpu's. If x86 Xeon/Opteron type cpu's get produced in 65 nm, you  
can't get away with 90nm, let alone 130nm parts. Tukwila is 65nm or  
so? All processors that should still release in 2009/2010 should be  
at least 45nm, as that's the
standard now. There is no way you can compete with a 4 core 65nm  
processor with a 45 nm x86 hpc cpu that has 8 cores.
Now one would have a good excuse in case you go speak about shared  
memory machines, but realize latency plays a crucial role there.

SGI altix3000 series is basically 280ns (random 8 byte reads from big  
buffer) shared memory until 4 sockets.

So you can compare with 4 socket machines very well.
Above 4 socket usage, SGI has shared memory, but latency drops  
instantly to 700 ns, up to 5000-7000 nanoseconds for 62 sockets.

700 ns is also the latency the 'glued together' 8 socket machines had  
in those days having 8 P4 Xeon MP's.
Reality is simply that majority of parallel software doesn't work  
well with such latencies/bandwidth.

c) 4 core tukwila 2.4Ghz or so? It is of course factor 4 slower or so  
than the upcoming 8 core Xeon MP 3.4Ghz or so (just guessing) for  
integer codes

d) itanium2 core was just simply a bad performer thanks to a very  
silly design. No instructions on the L2 cache, yet equipping it with  
an ultra tiny L1 instruction cache of 32KB. This where instruction  
size is huge and it uses bundles of 2 x 3 instructions, so 6 in  
total. Additionally it is those instruction
bundles that had to replace branch mispredicts in a clever manner. So  
that means you really need to execute a LOT of code and therefore  
have a big
need for a HUGE L1i cache. Opteron and core2/i7 total own itanium2  
there at many software programs.

e) itanium only could get bought from companies that really  
overcharged for the hardware. Now we can forget about the power such  
parts eat, as power is really cheap for big companies that eat a lot.  
Up to factor 20 to what you pay at home (and if your government  
building is paying a 'normal' price, then now it is the time to start  
negotiating there).

f) itanium2 total focussed upon floating point, yet that is about the  
least interesting thing for such hardware to do; there always have  
been cheaper floating point solutions than itanium2. Let's forget  
about the disaster called itanium-1. A wrong focus. Integer speed  
simply matters for such expensive cpu's. This is why IBM could sell  
power6.

g) itanium2 had just 2 integer units, this where total bundle is 6.  
Power6 has same mistake if you ask me, yet at least can make up some  
of that mistake by a 5Ghz frequency. The x86 cpu's total finish both  
processors there in a skilled manner for many software programs.

h) i ran a few things on brandnew itanium2 1.3Ghz at the time and was  
amazed that opteron 2.2 Ghz (already existing for months) was factor  
3 faster for some simplistic programs like random number generators.

It appeared simply that the instruction set of itanium2 was missing a  
lot of simplistic instructions. Rotate for example. Now excuse me,
cryptographic spoken i'm no big hero, as frankly i know nothing about  
cryptography. Yet some instructions get used EVERYWHERE. Now we'll  
excuse it for not having a division instruction.

Now what most scientists do not realize is that for FFT, something  
most of them use, even if they don't know theory behind it (who am i  
to claim so),
that in floating point this gives nonstop rounding errors. In short  
integer FFT's are really important to prove something correct. In  
fact those would run faster if hardware would offer more support. So  
just make 2 integer multiplication units and soon all FFT's get  
rewritten to integer and run faster and with LESS errors. In fact no  
error at all. So no chance for an error backtracking at all. That  
worst case that could happen every time in floating point is not there
simply in integers.

Yet itanium2 does not have support for 64 bits integers very much. It  
doesn't have an instruction to multiply a * b and get 128 bits  
precise answer back.
To do that you have to simulate it with floating point instructions.

That sucks, let's use polite wording here.

Of course when buying a machine, no one *gets the idea* to test this.

A more general comment on the above:

You can DEFINITELY blame most HPC guys to focus too much upon  
floating point. I get impression they don't realize that in a double  
precision floating point you can pack LESS bits than in a 64 bits  
integers for FFT and that there is no error in integers.

These HPC guys are really good in creating their own problem. Itanium  
is a clear demonstration of how HPC guys mess up a lot
and are nerds easy to manipulate by some clever marketing guys.

So i guess intel is doing what you can expect from a hardware  
company. For years it tries to get away with the cheapest possible  
designed cpu ever for HPC, and grabs money out of the market with it,  
until the last professor who still believes the years 90 and start  
21th century propaganda, has retired
from the hardware commission and gets replaced by a ruthless  
commissioner who gives the contract to the company in an open bid  
without stupid
nerd demands (that already give in advance the contract job to just 1  
company which then can ask any price of course).

Everything relevant already runs for years at x86 cpu's by now, even  
ISPs and telecommunication do by now.
That will only increase. With respect to HPC Itanium never was faster  
than x86 processors that is the problem.
In floating point of course x86 you could get for same price of 1  
itanium quad socket node a sporthall filled with x64's,
and in integers the x86 cpu's were just a lot faster. Intel has  
fooled everyone there with good marketing.

i) See what is happening now. There is quite some rumour about  
tukwila, in meantime of course the 8 core Xeon MP is going to kick butt
and get good sales, AND THEREFORE OF COURSE PRIORITY OF INTEL  
EVERYWHERE.

This is something you can explain to a nerd forever, and he still  
will not understand it. Nor will the majority of this list.

If intel has a bunch of cpu's:
     i7's / Nehalems versus Itanium versus Larrabee

Guess what has the LOWEST priority then?

It is obvious that itanium never got priority internal at intel. It  
is questionable whether something like a larrabee version for HPC
is a good idea to give serious attention knowing that the Xeon  
platform in the end will get a bigger priority anyway. There is so  
much more money
at stake in the Xeon platform.

Want support for your itanium cpu? Be glad if the 12 year old nephew  
of the Bangalore (India) stationned engineer helps you.
    "Are you se-ure about your question is proper, mista"
X questions and at every question:
    "Yes, Yes ,Yes, Yes, Yes, Yes, Yes".
But never the answer you want to have.

That is very bad for HPC guys used to service and with a big  
technological demand.

j) for years HPC used 64 bits processors and x86 was 32 bits. That  
has changed however. Though 32 bits still runs a lot faster and  
should run faster,
the tiny processors are 64 bits now also. So that already loses a lot  
of the market to the always cheaper x64 processors

k) everything is massive parallel nowadays, so for a lot of different  
software it doesn't matter whether you use 1 big fast processor or 10  
cheap low power processors which together are just as fast; this  
under the condition your network solves somehow the bandwidth problem  
and provided that it is cheaper to produce those 10 tiny processors  
than 1 big fast processor. In short there is always the option now to  
use tiny processors for a big cluster for really a lot of software.  
As long as that this software scales somehow.

l) The biggest problem by far of itanium is, will be and was that a  
F1 car driver you don't let him toy with just 2 bicycles,
he needs a F1 car to drive in.

If it cannot compete at price with a x86 processor any special HPC  
processor has to face the hard condition that it has to be a lot faster.

Vincent

> (why am I down on ia64?  mainly the sense of unfulfilled promise:  
> the ISA was
> supposed to provide some real advantage, and afaikt never has.  the  
> VLIW-ie
> ISA was intended to avoid clock scaling problems created by CISC  
> decode and
> OOO, no?  but the ia64 seems to have only distinguished itself by  
> relatively
> large caches and offering cache-coherency hooks to SGI.  have other  
> people had the experience ia64 doing OK on code with regular/ 
> unrollable/prefetchable data patterns, but poorly otherwise?)
>
> regards, mark hahn.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf