[Beowulf] More AMD rumors

Mon Nov 19 10:40:41 PST 2012

On Nov 19, 2012, at 7:23 PM, Robert G. Brown wrote:

> On Mon, 19 Nov 2012, Vincent Diepeveen wrote:
>
>>
>> On Nov 19, 2012, at 6:12 PM, Robert G. Brown wrote:
>>
>>> On Mon, 19 Nov 2012, Vincent Diepeveen wrote:
>>>
>>>> If you measure memory latency at all 8 cores at the same time, it's
>>>> even more horrible.
>>>
>>> Thanks for a remarkably clear and useful reply, Vincent.  This  
>>> nearly
>>> precisely mirrors my own measurements with a more floating point
>>> intensive task.  The larger i7-3770 cache and its 8 operational
>>> contexts
>>> (it is a four core system but it maintains two completely  
>>> independent
>>> contexts per core, IIRC) seem to give it an overwhelming advantage
>>> over
>>> the FX with its eight "real" cores but much smaller cache.
>>> Interesting
>>> to see that this continues with the (I assume) integer/logic  
>>> intensive
>>> chess code.
>>
>> Maybe you meant saying it correctly but wrote it wrong.
>>
>> The FX8150 has a huge SLOW L2 cache of 1MB or so (2MB a module) and
>> the i7's all have
>> a small FAST L2 cache of around 256KB.
>
> Um, from cat /proc/cpuinfo:
>
> processor       : 7
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 58
> model name      : Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
> stepping        : 9
> cpu MHz         : 1600.000
> cache size      : 8192 KB
> ...

That's the  L3 cache. The L3 cache is just SRAM so to speak.

What i referred to is the big difference in L2 cache.

Here is what you look for:

http://en.wikipedia.org/wiki/Ivy_Bridge_%28microarchitecture%29

CPUID code 0306A9h Product code 80637 (desktop)

L1 cache 64 kB per core
L2 cache 256 kB per core
L3 cache 3 MB to 8 MB shared

The AMD FX-8150 on the other hand has a whopping 1MB of L2 each core.
The L3 cache is 8MB as well.

Though i probably ask for the fury of hardware engineers if i say  
that between the i7 and the bulldozer
there is no big deal difference between the L1 and the L3, there is a  
huge difference between the L2

Both CPU's decode 4 instructions a clock by the way. AMD for 1  
module, Intel for 1 core.

The real big difference is therefore the latencies of the caches and  
RAM and the speed (or slowness) of the execution units.

AMD had a new design trick up their sleeves btw to do the decoding  
'better'. Splitting it from 1 bundle of 4 instructions
to 2 bundles of 2 instructions each clock. That should speed a few  
percent up for parallel workloads...

Yet you compare with quadcore intels, which is a big nonsense IMHO.
More interesting is the sixcore intels and 8 core intels if you ask me.

>
> Note well that 8192 KB is 8 MB as of last time I looked.  Although  
> my FX
> box is turned off at the moment because it is loud in ADDITION to  
> being
> hot, I recall it had a 2 MB cache total.  I'll boot it later today and
> look again, but I think I reported all of this on list two weeks ago,
> with graphs.
>
> I don't know if these cache sizes are per core or per system, but note
> that I'm up to "processor 7" on "core 3" according to other lines.   
> The
> kernel, at least, thinks that the Intel has way more L2 than the AMD.
> Since the only way I can possibly interpret my job getting linear
> speedup on four cores through eight tasks is for the job to more or  
> less
> be executing out of cache so that it was just doing hardware context
> switches into separate unblocked ALUs or the like, I sort of  
> believe it.
>
>>
>> If we measure accurately then the FX8150 gets a huge speedup from the
>> SMT.
>> So moving from 4 cores to 8 cores it benefits really a lot. Exactly
>> what you would expect
>> with a slow L2 cache.
>>
>> The i7 on the other hand hardly profits from Hyperthreading. In
>> general the higher you clock (or overclock)
>> the i7 it profits more yet we speak about a small percentage still.
>> 20% at lower clock up to 30% high clock
>> for Diep.
>>
>> For most number cruncing floating point code here (prime numbers) the
>> speedup from hyperthraeding
>> is more around 5%, so it hardly benefits there.
>>
>> At the more modern i7's the multiplication unit has been speeded up.
>> So it can deliver a much bigger
>> throughput there.
>>
>> This whereas the FX8150 has been slowed down factor 2.
>>
>>>
>>> Basically, the i7 looks like a butt-kicking good processor, with
>>> the one
>>> problem being that it doesn't look like a multiprocessing cpu (at
>>> least
>>> I can't find a dual i7 motherboard, although in principle it
>>> appears to
>>> be possible, leaving one with Xeons that don't LOOK like they would
>>> perform as well although I'd be interested in information on that as
>>> well.
>>
>> The i7-3770k is the latest i7 and it's Ivy Bridge.
>> It's really low power though, just around 50 watts.
>>
>> The Xeons are all older generation i7, a Sandy Bridge. They eat lots
>> of power,
>> yet performance is very good.
>>
>> Intel wants to cash in on them, AMD really messed up in that market
>> segment.
>>
>> For most servers in server market, not to confuse with HPC,
>> power consumption does matter and intel is winning the battle there.
>>
>>>
>>> At the moment, single processor i7's look like they might  
>>> actually be
>>> the world's fastest, at least on a per core basis.  OTOH, it might
>>> well
>>> be that putting two of them on a single board would horribly  
>>> saturate
>>> the memory bus and cause memory management collisions and worse and
>>> cost
>>> them their advantage.
>>
>> In itself AMD's coherency protocol is in some areas superior to  
>> intels.
>>
>> Intel already struggles there for a big number of years, which is
>> especially visible in the 4 socket domain
>> not to mention 8 sockets.
>>
>> Note that newer Xeons have a few features which AMD doesn't have,
>> which in some software
>> might kick butt. That's synchronisation within the L3s, whereas AMD
>> goes via the RAM.
>>
>> I'm not into patents, yet it's possible one reason of succes is that
>> AMD took over DEC Alpha's
>> master slave concept. I'm not sure whether intels problem was to get
>> around those patents.
>>
>> In either case, latency to the RAM intel always was faster than AMD,
>> except for when intel still was
>> off die with the RAM and opteron released.
>>
>> AMD then got quickly 50% market share in the server market with
>> opteron for a short while.
>>
>> I wrote a testprogram to measure latency to the RAM doing just random
>> reads of 8 bytes into a big buffer,
>> with all cores at the same time.
>>
>> From head i remember next numbers:
>>
>> i7 single chip  : 60 - 70 ns
>> dual i7 Xeon  3.4Ghz : 90 ns
>> Phenom DDR3 : 100 ns
>> FX8150 : 160+ ns   (thanks to Joel Hruska for benchmarking)
>>
>> So AMD's design idea now to design a chip with a latency even worse
>> than their previous generation Phenom core
>> is not explainable for the servermarket. They did do well previous
>> time when latency to the RAM was BETTER than from
>> intel. So getting it worse there is a weird decision.
>>
>> This is not just architect faults. This is something so important to
>> a company like AMD, the CEO must be involved in such
>> decisions.
>>
>> In all server loads this latency issue of the bulldozer is a BIG
>> issue why it is so slow.
>>
>> Both the L2 latency as well as the RAM.
>>
>> Please note if you measure single core to 4 cores the latency at
>> bulldozer is a lot faster. It slows down really a lot when putting
>> all cores under load.
>>
>>>
>>> I'm getting ready to do some very data intensive stuff -- terabyte-
>>> scale
>>> datasets being chewed to pieces basically -- to the point where my
>>> "cluster" will probably be a pile of RAIDs each with its own private
>>> copy of the datasets in questions and equipped with an i7  
>>> motherboard,
>>> which seems odd somehow (as the i7 motherboards aren't generally
>>> configured as "server" motherboards) but the Xeons all run at lower
>>> clock and are older technology.
>>>
>>> Comments from anyone else?
>>>
>>
>> cheapskate clusters with low clocked cpu's are total unbeatable
>> pricewise.
>>
>> I don't know whether you can use AVX. If not did you consider buying
>> for $150 a bunch of nodes 2 socket Xeon L5420 or something
>> with 8 GB ram?
>>
>> For a single i7 system you can get 3 to 4 of them.
>>
>> Another idea is using a 48 core AMD system. Though on ebay the cpu's
>> are a tad more expensive now,
>> the 6180SE if you buy 4 of them and a motherboard, you have 48 cores,
>> huge RAM and 4 memory controllers and 6 memory channels
>> a socket.
>>
>> A total of 24 memory channels or so (if i did do my math ok).
>>
>> Until recently these 6180SE cpu's were $450 on ebay, though i see
>> them now for $650 or so.
>>
>> If your workload parallellizes well it could be an idea. They do not
>> have AVX however.
>>
>> For what you are gonna do maybe your biggest pal is ebay, regardless
>> what you want to order.
>>
>>
>>>    rgb
>>>
>>>>
>>>>> I would have hoped that AMD would dig in an innovate and
>>>>> regain at least parity if not the lead, because it is good for the
>>>>> industry for Intel to have serious competition, but while Intel
>>>>> could
>>>>> make money and survive as second best to AMD, AMD can't make any
>>>>> money
>>>>> as second best to Intel...
>>>>
>>>> We must split of course the 2 worlds of HPC performance.
>>>> In fact htere is 3 but let's do a rough 2 world division
>>>>
>>>> a) floating point or vectorized performance (can be integers as  
>>>> well)
>>>>
>>>> We skip A : the manycores have won there.
>>>>
>>>> b) integer performance non-vectorized
>>>>
>>>> For integers and branches if i take a huge program like Diep.
>>>>
>>>> http://www.lostcircuits.com/mambo//index.php?
>>>> option=com_content&task=view&id=105&Itemid=42&limit=1&limitstart=13
>>>>
>>>> More is better.
>>>>
>>>> i7-3960X-EE : 2.0 Million chess positions a second   (12 logical
>>>> cores)
>>>> i7-980x turbo: 1.85 Million chess positions a second (12 logical
>>>> cores)
>>>> i7-3770k:         1.47 million chess positions a second (8 logical
>>>> cores)
>>>> AMD Phenom X6 1100T : 1.34 million chess positions a second (6  
>>>> cores)
>>>> AMD Phenom X6 1090T : 1.30 million chess positions a second (6  
>>>> cores)
>>>> FX-8150 : 1.22 million chesspositions a second (8 mini cores)
>>>>
>>>> The FX-8150 is AMD's latest 'bulldozer' CPU.
>>>>
>>>> The problem is the new generation FX-8150 at a NEW process
>>>> technology, with 2 billion transistors or so (caches counted
>>>> - the initial press release from AMD - not the later one where they
>>>> creatively not counting things reached 1.2 billion)  is not beating
>>>> their own old design.
>>>>
>>>> Furthermore another big problem is power usage.
>>>>
>>>> http://www.lostcircuits.com/mambo//index.php?
>>>> option=com_content&task=view&id=105&Itemid=42&limit=1&limitstart=6
>>>>
>>>> Under full load:
>>>>
>>>> Phenom X6 1090T : 69.6 watt,
>>>> Phenom X6 1100T : 92 watt
>>>>
>>>> We see how the 1100T already was clocked a tad too high by AMD,  
>>>> which
>>>> explains the huge power increase.
>>>>
>>>> Now the FX-8150 : 115.2 watt
>>>>
>>>> As if Law of Moore garantueeing progress doesn't exist...
>>>>
>>>> As for you, in many benchmarks you did do maybe multiplication was
>>>> important. Each minicore has its own multiplication unit.
>>>> Sounds good huh?
>>>>
>>>> So far the good news: the problem is: it's also over 2 times slower
>>>> that unit...
>>>>
>>>> Please note that bulldozer does have AVX. From benchmarks we know
>>>> that both intel as well as AMD with this bulldozer,
>>>> had tried to optimize performance for game. Games using AVX
>>>> especially.
>>>>
>>>> It's not doing bad there in fact. Worse than the quadcore intels. I
>>>> don't want a quadcore chip though.
>>>> I want a million cores.
>>>>
>>>>>
>>>>>     rgb
>>>>>
>>>>>>
>>>>>> --
>>>>>> Doug
>>>>>>
>>>>>> --
>>>>>> Mailscanner: Clean
>>>>>>
>>>>>> _______________________________________________
>>>>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>>>>>> Computing
>>>>>> To change your subscription (digest mode or unsubscribe) visit
>>>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>>>>
>>>>>
>>>>> Robert G. Brown	                       http://www.phy.duke.edu/ 
>>>>> ~rgb/
>>>>> Duke University Dept. of Physics, Box 90305
>>>>> Durham, N.C. 27708-0305
>>>>> Phone: 1-919-660-2567  Fax: 919-660-2525      
>>>>> email:rgb at phy.duke.edu
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>>>>> Computing
>>>>> To change your subscription (digest mode or unsubscribe) visit
>>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>>
>>>> _______________________________________________
>>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>>>> Computing
>>>> To change your subscription (digest mode or unsubscribe) visit
>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>>
>>>
>>> Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
>>> Duke University Dept. of Physics, Box 90305
>>> Durham, N.C. 27708-0305
>>> Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
>>>
>>>
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
>> Computing
>> To change your subscription (digest mode or unsubscribe) visit  
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>
> Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
> Duke University Dept. of Physics, Box 90305
> Durham, N.C. 27708-0305
> Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
>
>