[Beowulf] Xeon Phi out as well [kraut]
Vincent Diepeveen
diep at xs4all.nl
Tue Nov 13 05:58:58 PST 2012
Interesting article.
Regrettably the writer is a technical noob, clearly readable in the
German he writes.
Confusing MB with GB, so it's not so clear how accurate it is what he
writes. Well what can you
expect from Heise.de in that sense...
Let's assume that majority he wrote down is ok.
Then we speak about 60 cores at 1.053Ghz using vectors of 512 bits,
so that's 8 doubles i assume or AVX2.
The horror architecture previously called Larrabee.
Just more cores. I read nothing about cache coherency anymore and the
fact they can 'turn off' 2 cores obviously means
it might not have it. So it's no longer having the bottleneck that
Larrabee had.
You have to run 4 threads at it simultaneouslly says this article.
That's factor 2 more than todays top GPU's need.
Both AMD as well as Nvidia you can perform well running 2 'threads'
"at the same time' (they get alternated).
I assume that's for the same reason, namely to hide the latency
that's there from releasing results after the execution units executed
the instructions.
From Larrabee we knew that pretty important instructions to HPC were
not having a good throughput handling, eating several
cycles. So it's difficult to do calculations now on what is possible
to achieve.
Let's assume now 1 instruction can get executed and retired each
clockcycle.
This is a dangerous assumption, as intel historically doesn't have
very good multiplying execution units at not a single architecture
when compared to competitors. Historically latency also at their
x86 / x64 cpu's was nearly factor 2 worse than for example AMD's
opterons. This for 64 bits (integers) multiplication. Latest i7
should have improved there though.
Under this assumption throughput latency is 1 clock, and that
multiply-add is several clocks, that gives us:
1.053Ghz * 60 cores * 8 = 505.44 Gflop
Knowing that everyone always "lies" that factor 2 to it for multiply-
add, even though i bet no one will manage to push
them through within 1 cycle an instruction in a nonstop manner; Also
the big transforms using Fourier Transforms,
they cannot use multiply-add at all, yet if we ignore that, like
everyone ignores it,
that gives a bragging rights of 2 * 505 = 1.01088 Tflop
This isn't bad at all considering the fact that K20, which based upon
Moore's Law deduction of transistors to
doubling of speed, would have landed nearby 2 Tflop, appears to be
just above 1.0 Tflop right now.
The fear was of course the latest Larrabee incarnation, Xeon Phi
would cost $10k, yet it seems intel wants to conquer the HPC market
and Heise gives here first time i see it
a price for it which is 2649 dollar.
Available in 2013 though - which is a disadvantage.
Of course be careful buying this chip if you don't know what AVX2 is.
Many tried to write code for AVX2 and it took them years to get some
prime number transforms to work a tad at it.
We see that Intel has deviated from their original plan, yet that
they still tell the nonsense story to reporters
as if it would be interesting to run pentium code at it.
A single i7 will beat it there of course, as to get to the maximum
throughput, you need to put your data
inside vectors of 8 doubles, otherwise it will perform horrible.
Assuming the Larrabee instruction set survived, it is also possible
to indirectly acces each core using special
instructions.
Those had however a 7 cycle latency at Larrabee so it's not very
encouraging to use them.
So doing the same thing you can reasonably simple do at GPU's, is
pretty difficult here, yet not impossible.
Of course the only bummer is that it's not yet available.
Where this from marketing viewpoint is a good idea though from intel
to already release it now,
as otherwise everyone would already sign a deal with Nvidia, we know
from some years ago how intel brought several
HPC organisations in big problems by simply not delivering the
itanium2 cpu's at the appointed time. That took another
6 months to a year. As they all talk there with each other, i am not
sure of the impact of this.
It's obvious however intel wants to compete right now by pricing the
chip not so expensive. That's good for the HPC community.
Now let's hope that none of the manufacturers gets a total monopoly,
otherwise we'll be paying that $7500 that Itanium2 1.5Ghz
had as a cost price at introduction.
Financially seen these manufacturers can easily offer these cpu's for
$1500 - $2k,
as that pays back easily all production and development costs.
On Nov 13, 2012, at 1:40 PM, Eugen Leitl wrote:
>
> http://www.heise.de/newsticker/meldung/SC12-Intel-bringt-
> Coprozessor-Xeon-Phi-offiziell-heraus-1747942.html
>
> http://translate.google.com/translate?
> sl=auto&tl=en&js=n&prev=_t&hl=en&ie=UTF-8&layout=2&eotf=1&u=http%3A%
> 2F%2Fwww.heise.de%2Fnewsticker%2Fmeldung%2FSC12-Intel-bringt-
> Coprozessor-Xeon-Phi-offiziell-heraus-1747942.html&act=url
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list