[Beowulf] itanium vs. x86-64
Vincent Diepeveen
diep at xs4all.nl
Mon Feb 9 17:18:19 PST 2009
On Feb 9, 2009, at 11:45 PM, Mark Hahn wrote:
>> I have been working on itanium machines for a year now and I
>> actually found
>> the hw pretty elegant and the dev software stack on top of it
>> (compiler,
>> profiler etc) pretty handy.
>
And at how
> aren't all the same tools available on x86_64? or were you
> referring to, eg,
> something SGI-specific?
>
>> but now with the Tukwila switching to the QuickPath, how do you
>> guys think
>> Itanium will perform in comparison to Xeon's and Opteron's ?
>
> this change would be interesting if it meant that the next-gen
> numalink
> box could take nehalems rather than ia64. I can't really
> understand why Intel has stuck with ia64 this long - perhaps the
> economy will provide
> the fig-leaf necessary to dump it.
>
Of course there is a few points to adress:
a) SMT works a lot better at in order cores than HT at out of order.
Usually out of order, prime number FFT (DWT actually) type
software seems to profit about 5% there.
Branchy and memory dependant codes seem optimistically be able
to get around 10-20% scaling improvement.
The big difference is that at in order cores, see power6, you
can basically hide the branch misprediction penalties,
which is not possible at x64
This was of course one of the many promises of in order cores.
b) intel always ran behind in process technology with itanium and
never managed to clock it very high. IBM didn't make this mistake
with power6,
they clocked it to 5Ghz. Intel always ran behind in clockspeed with
itanium. Now you can take that for granted. You can also say: "heh
why am i paying $7500 for a 1.5Ghz itanium2 processor (at its release
and for quite some period of time it was priced like that) and why
has it been clocked that much lower, and is it process technologies
behind on the rest?"
You can say things like: "it needs more verification than cheapo x86
cpu's". However that excuse is only valid for 6 months and no longer
than that.
After 6 months it HAS to have the same process technology like x86
cpu's. If x86 Xeon/Opteron type cpu's get produced in 65 nm, you
can't get away with 90nm, let alone 130nm parts. Tukwila is 65nm or
so? All processors that should still release in 2009/2010 should be
at least 45nm, as that's the
standard now. There is no way you can compete with a 4 core 65nm
processor with a 45 nm x86 hpc cpu that has 8 cores.
Now one would have a good excuse in case you go speak about shared
memory machines, but realize latency plays a crucial role there.
SGI altix3000 series is basically 280ns (random 8 byte reads from big
buffer) shared memory until 4 sockets.
So you can compare with 4 socket machines very well.
Above 4 socket usage, SGI has shared memory, but latency drops
instantly to 700 ns, up to 5000-7000 nanoseconds for 62 sockets.
700 ns is also the latency the 'glued together' 8 socket machines had
in those days having 8 P4 Xeon MP's.
Reality is simply that majority of parallel software doesn't work
well with such latencies/bandwidth.
c) 4 core tukwila 2.4Ghz or so? It is of course factor 4 slower or so
than the upcoming 8 core Xeon MP 3.4Ghz or so (just guessing) for
integer codes
d) itanium2 core was just simply a bad performer thanks to a very
silly design. No instructions on the L2 cache, yet equipping it with
an ultra tiny L1 instruction cache of 32KB. This where instruction
size is huge and it uses bundles of 2 x 3 instructions, so 6 in
total. Additionally it is those instruction
bundles that had to replace branch mispredicts in a clever manner. So
that means you really need to execute a LOT of code and therefore
have a big
need for a HUGE L1i cache. Opteron and core2/i7 total own itanium2
there at many software programs.
e) itanium only could get bought from companies that really
overcharged for the hardware. Now we can forget about the power such
parts eat, as power is really cheap for big companies that eat a lot.
Up to factor 20 to what you pay at home (and if your government
building is paying a 'normal' price, then now it is the time to start
negotiating there).
f) itanium2 total focussed upon floating point, yet that is about the
least interesting thing for such hardware to do; there always have
been cheaper floating point solutions than itanium2. Let's forget
about the disaster called itanium-1. A wrong focus. Integer speed
simply matters for such expensive cpu's. This is why IBM could sell
power6.
g) itanium2 had just 2 integer units, this where total bundle is 6.
Power6 has same mistake if you ask me, yet at least can make up some
of that mistake by a 5Ghz frequency. The x86 cpu's total finish both
processors there in a skilled manner for many software programs.
h) i ran a few things on brandnew itanium2 1.3Ghz at the time and was
amazed that opteron 2.2 Ghz (already existing for months) was factor
3 faster for some simplistic programs like random number generators.
It appeared simply that the instruction set of itanium2 was missing a
lot of simplistic instructions. Rotate for example. Now excuse me,
cryptographic spoken i'm no big hero, as frankly i know nothing about
cryptography. Yet some instructions get used EVERYWHERE. Now we'll
excuse it for not having a division instruction.
Now what most scientists do not realize is that for FFT, something
most of them use, even if they don't know theory behind it (who am i
to claim so),
that in floating point this gives nonstop rounding errors. In short
integer FFT's are really important to prove something correct. In
fact those would run faster if hardware would offer more support. So
just make 2 integer multiplication units and soon all FFT's get
rewritten to integer and run faster and with LESS errors. In fact no
error at all. So no chance for an error backtracking at all. That
worst case that could happen every time in floating point is not there
simply in integers.
Yet itanium2 does not have support for 64 bits integers very much. It
doesn't have an instruction to multiply a * b and get 128 bits
precise answer back.
To do that you have to simulate it with floating point instructions.
That sucks, let's use polite wording here.
Of course when buying a machine, no one *gets the idea* to test this.
A more general comment on the above:
You can DEFINITELY blame most HPC guys to focus too much upon
floating point. I get impression they don't realize that in a double
precision floating point you can pack LESS bits than in a 64 bits
integers for FFT and that there is no error in integers.
These HPC guys are really good in creating their own problem. Itanium
is a clear demonstration of how HPC guys mess up a lot
and are nerds easy to manipulate by some clever marketing guys.
So i guess intel is doing what you can expect from a hardware
company. For years it tries to get away with the cheapest possible
designed cpu ever for HPC, and grabs money out of the market with it,
until the last professor who still believes the years 90 and start
21th century propaganda, has retired
from the hardware commission and gets replaced by a ruthless
commissioner who gives the contract to the company in an open bid
without stupid
nerd demands (that already give in advance the contract job to just 1
company which then can ask any price of course).
Everything relevant already runs for years at x86 cpu's by now, even
ISPs and telecommunication do by now.
That will only increase. With respect to HPC Itanium never was faster
than x86 processors that is the problem.
In floating point of course x86 you could get for same price of 1
itanium quad socket node a sporthall filled with x64's,
and in integers the x86 cpu's were just a lot faster. Intel has
fooled everyone there with good marketing.
i) See what is happening now. There is quite some rumour about
tukwila, in meantime of course the 8 core Xeon MP is going to kick butt
and get good sales, AND THEREFORE OF COURSE PRIORITY OF INTEL
EVERYWHERE.
This is something you can explain to a nerd forever, and he still
will not understand it. Nor will the majority of this list.
If intel has a bunch of cpu's:
i7's / Nehalems versus Itanium versus Larrabee
Guess what has the LOWEST priority then?
It is obvious that itanium never got priority internal at intel. It
is questionable whether something like a larrabee version for HPC
is a good idea to give serious attention knowing that the Xeon
platform in the end will get a bigger priority anyway. There is so
much more money
at stake in the Xeon platform.
Want support for your itanium cpu? Be glad if the 12 year old nephew
of the Bangalore (India) stationned engineer helps you.
"Are you se-ure about your question is proper, mista"
X questions and at every question:
"Yes, Yes ,Yes, Yes, Yes, Yes, Yes".
But never the answer you want to have.
That is very bad for HPC guys used to service and with a big
technological demand.
j) for years HPC used 64 bits processors and x86 was 32 bits. That
has changed however. Though 32 bits still runs a lot faster and
should run faster,
the tiny processors are 64 bits now also. So that already loses a lot
of the market to the always cheaper x64 processors
k) everything is massive parallel nowadays, so for a lot of different
software it doesn't matter whether you use 1 big fast processor or 10
cheap low power processors which together are just as fast; this
under the condition your network solves somehow the bandwidth problem
and provided that it is cheaper to produce those 10 tiny processors
than 1 big fast processor. In short there is always the option now to
use tiny processors for a big cluster for really a lot of software.
As long as that this software scales somehow.
l) The biggest problem by far of itanium is, will be and was that a
F1 car driver you don't let him toy with just 2 bicycles,
he needs a F1 car to drive in.
If it cannot compete at price with a x86 processor any special HPC
processor has to face the hard condition that it has to be a lot faster.
Vincent
> (why am I down on ia64? mainly the sense of unfulfilled promise:
> the ISA was
> supposed to provide some real advantage, and afaikt never has. the
> VLIW-ie
> ISA was intended to avoid clock scaling problems created by CISC
> decode and
> OOO, no? but the ia64 seems to have only distinguished itself by
> relatively
> large caches and offering cache-coherency hooks to SGI. have other
> people had the experience ia64 doing OK on code with regular/
> unrollable/prefetchable data patterns, but poorly otherwise?)
>
> regards, mark hahn.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list