[Beowulf] More AMD rumors

Mon Nov 19 10:23:07 PST 2012

On Mon, 19 Nov 2012, Vincent Diepeveen wrote:

>
> On Nov 19, 2012, at 6:12 PM, Robert G. Brown wrote:
>
>> On Mon, 19 Nov 2012, Vincent Diepeveen wrote:
>>
>>> If you measure memory latency at all 8 cores at the same time, it's
>>> even more horrible.
>>
>> Thanks for a remarkably clear and useful reply, Vincent.  This nearly
>> precisely mirrors my own measurements with a more floating point
>> intensive task.  The larger i7-3770 cache and its 8 operational
>> contexts
>> (it is a four core system but it maintains two completely independent
>> contexts per core, IIRC) seem to give it an overwhelming advantage
>> over
>> the FX with its eight "real" cores but much smaller cache.
>> Interesting
>> to see that this continues with the (I assume) integer/logic intensive
>> chess code.
>
> Maybe you meant saying it correctly but wrote it wrong.
>
> The FX8150 has a huge SLOW L2 cache of 1MB or so (2MB a module) and
> the i7's all have
> a small FAST L2 cache of around 256KB.

Um, from cat /proc/cpuinfo:

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 58
model name      : Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
stepping        : 9
cpu MHz         : 1600.000
cache size      : 8192 KB
...

Note well that 8192 KB is 8 MB as of last time I looked.  Although my FX
box is turned off at the moment because it is loud in ADDITION to being
hot, I recall it had a 2 MB cache total.  I'll boot it later today and
look again, but I think I reported all of this on list two weeks ago,
with graphs.

I don't know if these cache sizes are per core or per system, but note
that I'm up to "processor 7" on "core 3" according to other lines.  The
kernel, at least, thinks that the Intel has way more L2 than the AMD.
Since the only way I can possibly interpret my job getting linear
speedup on four cores through eight tasks is for the job to more or less
be executing out of cache so that it was just doing hardware context
switches into separate unblocked ALUs or the like, I sort of believe it.

>
> If we measure accurately then the FX8150 gets a huge speedup from the
> SMT.
> So moving from 4 cores to 8 cores it benefits really a lot. Exactly
> what you would expect
> with a slow L2 cache.
>
> The i7 on the other hand hardly profits from Hyperthreading. In
> general the higher you clock (or overclock)
> the i7 it profits more yet we speak about a small percentage still.
> 20% at lower clock up to 30% high clock
> for Diep.
>
> For most number cruncing floating point code here (prime numbers) the
> speedup from hyperthraeding
> is more around 5%, so it hardly benefits there.
>
> At the more modern i7's the multiplication unit has been speeded up.
> So it can deliver a much bigger
> throughput there.
>
> This whereas the FX8150 has been slowed down factor 2.
>
>>
>> Basically, the i7 looks like a butt-kicking good processor, with
>> the one
>> problem being that it doesn't look like a multiprocessing cpu (at
>> least
>> I can't find a dual i7 motherboard, although in principle it
>> appears to
>> be possible, leaving one with Xeons that don't LOOK like they would
>> perform as well although I'd be interested in information on that as
>> well.
>
> The i7-3770k is the latest i7 and it's Ivy Bridge.
> It's really low power though, just around 50 watts.
>
> The Xeons are all older generation i7, a Sandy Bridge. They eat lots
> of power,
> yet performance is very good.
>
> Intel wants to cash in on them, AMD really messed up in that market
> segment.
>
> For most servers in server market, not to confuse with HPC,
> power consumption does matter and intel is winning the battle there.
>
>>
>> At the moment, single processor i7's look like they might actually be
>> the world's fastest, at least on a per core basis.  OTOH, it might
>> well
>> be that putting two of them on a single board would horribly saturate
>> the memory bus and cause memory management collisions and worse and
>> cost
>> them their advantage.
>
> In itself AMD's coherency protocol is in some areas superior to intels.
>
> Intel already struggles there for a big number of years, which is
> especially visible in the 4 socket domain
> not to mention 8 sockets.
>
> Note that newer Xeons have a few features which AMD doesn't have,
> which in some software
> might kick butt. That's synchronisation within the L3s, whereas AMD
> goes via the RAM.
>
> I'm not into patents, yet it's possible one reason of succes is that
> AMD took over DEC Alpha's
> master slave concept. I'm not sure whether intels problem was to get
> around those patents.
>
> In either case, latency to the RAM intel always was faster than AMD,
> except for when intel still was
> off die with the RAM and opteron released.
>
> AMD then got quickly 50% market share in the server market with
> opteron for a short while.
>
> I wrote a testprogram to measure latency to the RAM doing just random
> reads of 8 bytes into a big buffer,
> with all cores at the same time.
>
> From head i remember next numbers:
>
> i7 single chip  : 60 - 70 ns
> dual i7 Xeon  3.4Ghz : 90 ns
> Phenom DDR3 : 100 ns
> FX8150 : 160+ ns   (thanks to Joel Hruska for benchmarking)
>
> So AMD's design idea now to design a chip with a latency even worse
> than their previous generation Phenom core
> is not explainable for the servermarket. They did do well previous
> time when latency to the RAM was BETTER than from
> intel. So getting it worse there is a weird decision.
>
> This is not just architect faults. This is something so important to
> a company like AMD, the CEO must be involved in such
> decisions.
>
> In all server loads this latency issue of the bulldozer is a BIG
> issue why it is so slow.
>
> Both the L2 latency as well as the RAM.
>
> Please note if you measure single core to 4 cores the latency at
> bulldozer is a lot faster. It slows down really a lot when putting
> all cores under load.
>
>>
>> I'm getting ready to do some very data intensive stuff -- terabyte-
>> scale
>> datasets being chewed to pieces basically -- to the point where my
>> "cluster" will probably be a pile of RAIDs each with its own private
>> copy of the datasets in questions and equipped with an i7 motherboard,
>> which seems odd somehow (as the i7 motherboards aren't generally
>> configured as "server" motherboards) but the Xeons all run at lower
>> clock and are older technology.
>>
>> Comments from anyone else?
>>
>
> cheapskate clusters with low clocked cpu's are total unbeatable
> pricewise.
>
> I don't know whether you can use AVX. If not did you consider buying
> for $150 a bunch of nodes 2 socket Xeon L5420 or something
> with 8 GB ram?
>
> For a single i7 system you can get 3 to 4 of them.
>
> Another idea is using a 48 core AMD system. Though on ebay the cpu's
> are a tad more expensive now,
> the 6180SE if you buy 4 of them and a motherboard, you have 48 cores,
> huge RAM and 4 memory controllers and 6 memory channels
> a socket.
>
> A total of 24 memory channels or so (if i did do my math ok).
>
> Until recently these 6180SE cpu's were $450 on ebay, though i see
> them now for $650 or so.
>
> If your workload parallellizes well it could be an idea. They do not
> have AVX however.
>
> For what you are gonna do maybe your biggest pal is ebay, regardless
> what you want to order.
>
>
>>    rgb
>>
>>>
>>>> I would have hoped that AMD would dig in an innovate and
>>>> regain at least parity if not the lead, because it is good for the
>>>> industry for Intel to have serious competition, but while Intel
>>>> could
>>>> make money and survive as second best to AMD, AMD can't make any
>>>> money
>>>> as second best to Intel...
>>>
>>> We must split of course the 2 worlds of HPC performance.
>>> In fact htere is 3 but let's do a rough 2 world division
>>>
>>> a) floating point or vectorized performance (can be integers as well)
>>>
>>> We skip A : the manycores have won there.
>>>
>>> b) integer performance non-vectorized
>>>
>>> For integers and branches if i take a huge program like Diep.
>>>
>>> http://www.lostcircuits.com/mambo//index.php?
>>> option=com_content&task=view&id=105&Itemid=42&limit=1&limitstart=13
>>>
>>> More is better.
>>>
>>> i7-3960X-EE : 2.0 Million chess positions a second   (12 logical
>>> cores)
>>> i7-980x turbo: 1.85 Million chess positions a second (12 logical
>>> cores)
>>> i7-3770k:         1.47 million chess positions a second (8 logical
>>> cores)
>>> AMD Phenom X6 1100T : 1.34 million chess positions a second (6 cores)
>>> AMD Phenom X6 1090T : 1.30 million chess positions a second (6 cores)
>>> FX-8150 : 1.22 million chesspositions a second (8 mini cores)
>>>
>>> The FX-8150 is AMD's latest 'bulldozer' CPU.
>>>
>>> The problem is the new generation FX-8150 at a NEW process
>>> technology, with 2 billion transistors or so (caches counted
>>> - the initial press release from AMD - not the later one where they
>>> creatively not counting things reached 1.2 billion)  is not beating
>>> their own old design.
>>>
>>> Furthermore another big problem is power usage.
>>>
>>> http://www.lostcircuits.com/mambo//index.php?
>>> option=com_content&task=view&id=105&Itemid=42&limit=1&limitstart=6
>>>
>>> Under full load:
>>>
>>> Phenom X6 1090T : 69.6 watt,
>>> Phenom X6 1100T : 92 watt
>>>
>>> We see how the 1100T already was clocked a tad too high by AMD, which
>>> explains the huge power increase.
>>>
>>> Now the FX-8150 : 115.2 watt
>>>
>>> As if Law of Moore garantueeing progress doesn't exist...
>>>
>>> As for you, in many benchmarks you did do maybe multiplication was
>>> important. Each minicore has its own multiplication unit.
>>> Sounds good huh?
>>>
>>> So far the good news: the problem is: it's also over 2 times slower
>>> that unit...
>>>
>>> Please note that bulldozer does have AVX. From benchmarks we know
>>> that both intel as well as AMD with this bulldozer,
>>> had tried to optimize performance for game. Games using AVX
>>> especially.
>>>
>>> It's not doing bad there in fact. Worse than the quadcore intels. I
>>> don't want a quadcore chip though.
>>> I want a million cores.
>>>
>>>>
>>>>     rgb
>>>>
>>>>>
>>>>> --
>>>>> Doug
>>>>>
>>>>> --
>>>>> Mailscanner: Clean
>>>>>
>>>>> _______________________________________________
>>>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>>>>> Computing
>>>>> To change your subscription (digest mode or unsubscribe) visit
>>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>>>
>>>>
>>>> Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
>>>> Duke University Dept. of Physics, Box 90305
>>>> Durham, N.C. 27708-0305
>>>> Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
>>>>
>>>>
>>>> _______________________________________________
>>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>>>> Computing
>>>> To change your subscription (digest mode or unsubscribe) visit
>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>>> Computing
>>> To change your subscription (digest mode or unsubscribe) visit
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>
>>
>> Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
>> Duke University Dept. of Physics, Box 90305
>> Durham, N.C. 27708-0305
>> Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
>>
>>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu