[Beowulf] Three notes from ISC 2006

Vincent Diepeveen diep at xs4all.nl
Wed Jun 28 17:07:06 PDT 2006


The team HAD a system a month ago nearly at the world champs computerchess 
2006.

And it clocked 3Ghz and had 2 sockets (4 cores in total)
3Ghz is 25% more than the 2.4Ghz dual core opterons i've got here
and they get a 20% higher IPC.

So effectively it's 50% faster.

Now we already knew at some floating point benchmarks that are not so 
dependant upon
a huge L3 cache (which most specfp software is somehow) that it destroyed 
quite a lot the
competition, but chess is completely integer only and has a lot of branches 
that nonstop get
mispredicted.

So getting a 20% higher IPC there than k8 is quite *impressive*.

Of course it's not clear to what amount of that 20% the ddrII contributes.

As it has the same latency, just more bandwidth, according to what i read 
online,
i'd say at most 2% of that is RAM.

The team using it is the junior team. Amir Ban and Shay Bushinsky. The last 
is the programmer
the first the former foreman of www.m-systems.com

You can find them in Tel Aviv somewhere.

The majority of systems in financial world that run databases and are 
serving are 2 sockets, not 4 sockets.

It's very fair to compare woodcrest to k8, because the next generation chip 
from AMD is K8L and as
that must use 0.065 technology which will under normal circumstances take 
till 2008 or so to get sold in shops,
whereas woodcrest can be ordered today from Dell (and hopefully soon gets 
delivered).

Basically first that K8L chip must tape out then it takes another year to 
produce it and get it in the shops. That's how it normally works.
So januari 2008 would be already good.

Of course i hope AMD proves us wrong there!

But the combination of new process technology + moving from 3 to 4 
instructions a cycle will of course give massive problems
and headaches to AMD. Especially knowing the years of delay it took to 
introduce previous technology (0.09) when it was new.

In short AMD will have to release some quad core k8 end of this year to be 
able to compete with woodcrest AND clock it to 3Ghz.

Of course putting 2 more cores to k8 is simpler for AMD than to design a new 
core that executes at 4 instructions a cycle.

Right now what AMd is doing is simply putting in more watts. 125 watt is 
just over the top IMHO.
That's even more than what intel did do to Xeon P4 when it had failed (105 
watt) and similar to itanium (125 watt also).

That dual opteron dual core 2.4Ghz here is already nearly uncoolable.

Not to mention what happens when i've got my beowulf online here with 14-16 
nodes!

Woodcrest has 3 floating point/SSE2 units so it can effectively execute 4 
instructions a cycle (as about every other instruction is a memory operand 
and amazingly that doesn't get counted as a flop but in reality MUST get 
executed) versus some improved k8 end this year will do 3 instructions a 
cycle.

Now we didn't talk about what happens to k8 when it executes a vector path 
instruction which basically completely blocks all its execution units and 
effectively eats a cycle or 7 to 8. Multiplying is 4 cycles, very important 
for matrix calculations, and that's very nice at opteron of course.

Just the IPC difference that's a 33% difference in advantage to woodcrest at 
this moment and woodcrest is not more expensive than the 265-285 series from 
AMD.

Then additionally intel has the major advantage of having a good compiler 
for it, which is really the big killer in performance.

I didn't even compile Diep yet for intel c++ to test at woodcrest. What 
speed will it get at it then?

More or less than 20% ipc difference to AMD?
I count at *more*.

The potential advantage of woodcrest is therefore 33% for our software 
that's hardly multiplying and 99.9% of the time executing integer code.
AMD wins back some because branches run faster at it i guess.

Seems that certain mispredicted branches at AMD just eat a cycle or 5.

Vincent

----- Original Message ----- 
From: "Craig Tierney" <ctierney at hypermall.net>
To: "Vincent Diepeveen" <diep at xs4all.nl>
Cc: "Kevin Ball" <kball at pathscale.com>; "Erik Paulson" 
<epaulson at cs.wisc.edu>; <beowulf at beowulf.org>; "Patrick Geoffray" 
<patrick at myri.com>
Sent: Wednesday, June 28, 2006 11:44 PM
Subject: Re: [Beowulf] Three notes from ISC 2006


> Vincent Diepeveen wrote:
>> Woodcrest totally destroys everything in terms of raw cpu performance.
>>
>> Not only it clocks nearly 25% higher. According to junior team who used 
>> such a system
>> from HP (that's their normal sponsor) at world champs 2006 it was giving 
>> a 20% higher ipc for their program too.
>
> What do you mean 'clock nearly 25% higher'.  Higher than what?  I had
> the impression that the top clock rates of the Woodcrest are dropping 
> slightly from the Dempsey numbers.  The machines were just announced, the 
> team you refer to has a system already?
>
> Can you describe what program they were running?  What is FP intensive?
> Woodcrest has added an additional 128-bit SSE2 register, but no additional 
> memory bandwidth to support it.  Double the FP performance is nice, but in 
> practice I wonder if I will see even a 5% increase for my codes.  The 
> change is nice for linpack (doubles node performance), but even the 
> efficiency drops somewhat.
>
>>
>> That's 50% faster than 2.4Ghz dual core opteron.
>>
>> Only for those who need latency to the RAM above cpu performance,
>> A64-single core with 16GB RAM at each node will be more interesting.
>>
>> That's not many applications.
>>
>> Of course if you buy something *today* the dual core opteron is the 
>> preferred node,
>> as woodcrest isn't in the shop yet buyable.
>>
>> If your software can work with gigabit ethernet then of course the price 
>> per node of an A64 dual core with cheap RAM
>> and a cheap mainboard could be more interesting than a faster node that's 
>> a little bit more expensive, using DDRII ram.
>>
>> So the aspect of cost could be a concern.
>>
>> At dual socket level however, the choice is simple. Woodcrest will outgun 
>> AMD in a big way.
>>
>
> You are comparing Intel's latest processor (which little or no real 
> benchmarks) to AMD's last generation processor.  The next generation AMD 
> will include an extra SSE2 register.  AMD still scales much better from 1 
> socket to 2 socket (in general of course, your benchmarks may vary) and 
> Intel can't touch the > 2 socket market yet.
>
> (Not an AMD fanboy, just someone who appreciates seeing performance of 
> real codes than arguing performance based on architecture and 
> press-releases.)
>
> Craig
>
>
>> Add to that that the new socket from intel is like 125 watts TDP. That's 
>> just not normal. That's wasting as much as itanium2!
>>
>> Vincent
>>
>> ----- Original Message ----- From: "Kevin Ball" <kball at pathscale.com>
>> To: "Erik Paulson" <epaulson at cs.wisc.edu>
>> Cc: <beowulf at beowulf.org>; "Patrick Geoffray" <patrick at myri.com>
>> Sent: Wednesday, June 28, 2006 10:29 PM
>> Subject: Re: [Beowulf] Three notes from ISC 2006
>>
>>
>>> On Wed, 2006-06-28 at 13:41, Erik Paulson wrote:
>>>> On Wed, Jun 28, 2006 at 04:25:40PM -0400, Patrick Geoffray wrote:
>>>> >
>>>> > I just hope this will be picked up by an academic that can convince
>>>> > vendors to donate. Tax break is usually a good incentive for that :-)
>>>> >
>>>>
>>>> How much care should be given to the selection of the nodes? 
>>>> Performance
>>>> is a function of both the nodes and the interconnect - so while your
>>>> test cluster allows for direct comparisons of the interconnects it's 
>>>> only
>>>> for a cluster of AMD processors, or for Intel processors.
>>>
>>> Prior to Woodcrest, I would have said AMD 100%.  Now?  Its hard to say.
>>> I think AMD nodes will still tend to do better at scaling and show
>>> interconnects in a better light than Intel nodes, but Woodcrest
>>> performance looks like it may be good enough to at least make things
>>> competitive for all but the largest clusters.
>>>
>>>>
>>>> I could imagine there would be academic sites that would host this
>>>> thing, and possibly even spring for the nodes, provided that the
>>>> interconnects were donated and they got to use it when it's not in
>>>> use (and probably had some promise that no more than X% of the time
>>>> would the cluster be in "benchmark" mode)
>>>
>>> This is very possible... especially if the benchmarking results were
>>> interesting enough to pull some papers out of.
>>>
>>> -Kevin
>>>
>>>>
>>>> -Erik, not legally authorized to volunteer the University of Wisconsin 
>>>> to
>>>> host any such thing.
>>>
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org
>>> To change your subscription (digest mode or unsubscribe) visit 
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit 
>> http://www.beowulf.org/mailman/listinfo/beowulf
>
> 




More information about the Beowulf mailing list