[Beowulf] SPEC CPU 2006 released

Sun Sep 3 10:58:59 PDT 2006

Vincent,

I think that we have two different audiences in this discussion and you 
are addressing one of them. The programming side of spec and hw 
benchmarks is interesting. To me its interesting theoretically. What I 
care about is how the majority of my applications perform on a given set 
of hardware, with a given compiler. My application range of embarassing 
parallel to specifically unparallel.

For the past 6 years, I have used GAMESS and Gaussian with a set of 
input files for my benchmarks. At this point I have results that include 
MIPS R8k, R10K, R14k, PII, PIII, PIV, Xeon, Athlon, Opteron, UltraSparc 
1, UltraSparcII, UltrasparcIII, UltraSpacrIV, and UltraSparc IV+,

So while I agree with your conclusions for the subgroup that you are 
addressing, it is not the only subgroup within the discussion..

Mike Davis

Vincent Diepeveen wrote:

>
> ----- Original Message ----- From: "Robert G. Brown" <rgb at phy.duke.edu>
> To: "Vincent Diepeveen" <diep at xs4all.nl>
> Cc: "Geoff Jacobs" <gdjacobs at gmail.com>; <beowulf at beowulf.org>
> Sent: Saturday, August 26, 2006 1:43 PM
> Subject: Re: [Beowulf] SPEC CPU 2006 released
>
>
>> On Sat, 26 Aug 2006, Vincent Diepeveen wrote:
>>
>>> Find me 1 site that 'tests' hardware that's objective. Spec is the 
>>> best compromise.
>>
>>
>> I don't know about site testing hardware, but there are many objective
>> hardware tests on sites.  In particular, lmbench is an excellent and
>> unbiased toolset (in my personal belief, having communicated fairly
>> extensively with Larry and Carl, knowing e.g. that Linus uses lmbench
>> routinely to test and tune the linux kernel).  benchmaster (my own
>
>
> lmbench is a great achievement to make, let's be clear there.
> i don't want to spit at it at all.
>
> lmbench is totally useless for 99% of the programmers who program for 
> speed (of course
> the group of programmers who needs speed is a small subgroup of the 
> total world, we just
> talk about that last group here which needs performance when you 
> scroll down).
>
> The latencies it reports are simply not the latencies THEY get because 
> worst cases of memory
> work different. Worst is always a random lookup to memory and majority 
> of software needs random
> lookups. You'll argue that on paper certain software you can rewrite 
> to not using it.
>
> That's just not reality.
>
> Basically what 99% needs is a good random latency, when randomly reading
> to memory with all the cpu's busy at the same time.
>
> Both AMD and Intel engineers clearly understand this lucky a lot 
> better than some simplistic lmbenches show here.
>
> What has improved at latest incarnations of AMD&Intel is the random 
> lookup latency, which typically does NOT
> fit in L2 cache like specint2000. For a short while specint2006 is 
> good. Sjeng for example uses 150MB ram,
> this where specint2000 crafty was lobotomized to using 2MB ram.
>
> You see now suddenly P4 is 2 times slower than AMD64. That didn't used 
> to be the case. So even studying
> improvement of benchmarking is showing the reality there.
>
> lmbench already historically has it wrong there. I remember past 10 
> years this problem repeating.
>
> What has a faster latency at this moment, a DDR ram Opteron (160-200 
> ns latency random memory lookups; 8 bytes in fact
> i lookup in own test) or a woodcrest system DDRII (100-140 ns latency 
> for random lookups through entire memory)?
>
> LMBENCH will just fool you there.
>
> Thanks to Bill here for testing my program at one of his woodcrest 
> systems.
>
> To slowly move to your next subject:
>
> This really is the times of the brilliant programmer. Most 
> universities that need a lot of crunching speed do not realize it yet,
> but should. If you want to do a simplistic modulo using CMOV's at the 
> processor
> of a tiny prime my code went like this. This is already a quite fast 
> way to do it, not used by most implementations:
>
>      if( i >= prime ) // 1x
>        i -= prime;
>      mask = i-3;
>      i = (i<<1)|1;
>      if( i >= prime ) // 2x
>        i -= prime;
>      mask |= (i-3);
>      i = (i<<1)|1;
>      if( i >= prime ) // 3x
>        i -= prime;
>      mask |= (i-3);
>      i = (i<<1)|1;
>      if( i >= prime ) // 4x
>        i -= prime;
>      mask |= (i-3);
>      i = (i<<1)|1;
>      if( i >= prime ) // 5x
>        i -= prime;
>      mask |= (i-3);
>      i = (i<<1)|1;
>      if( i >= prime ) // 6x
>        i -= prime;
>      mask |= (i-3);
>      i = (i<<1)|1;
>      if( i >= prime ) // 7x
>        i -= prime;
>      mask |= (i-3);
>      i = (i<<1)|1;
>      if( i >= prime ) // 8x
>        i -= prime;
>      mask |= (i-3);
>      i = (i<<1)|1;
>      if( i >= prime ) // 9x
>        i -= prime;
>      mask |= (i-3);
>      i = (i<<1)|1;
>      if( i >= prime ) // 10x
>        i -= prime;
>      mask |= (i-3);
>      i = (i<<1)|1;
>     if( mask == 0xffffffff ) { ... verification of mask
>
> So now we took up to 10 modulo's of 1 prime. However the problem of 
> all such codes is that it is totally sequential.
> I managed to get the program 50% faster however than the above code.
>
> How?
>
>    if( i >= prime ) // 1x
>      i -= prime;
>    if( i2 >= prime2 )
>      i2 -= prime2;
>    mask = i-3;
>    mask2 = i2-3;
>    i  = (i<<1)|1;
>    i2 = (i2<<1)|1;
>
>    if( i >= prime ) // 2x
>      i -= prime;
>    if( i2 >= prime2 )
>      i2 -= prime2;
>    mask  |= i-3;
>    mask2 |= i2-3;
>    i      = (i<<1)|1;
>    i2     = (i2<<1)|1;
>
>    if( i >= prime ) // 3x
>      i -= prime;
>    if( i2 >= prime2 )
>      i2 -= prime2;
>    mask  |= i-3;
>    mask2 |= i2-3;
>    i      = (i<<1)|1;
>    i2     = (i2<<1)|1;
>
>    if( i >= prime ) // 4x
>      i -= prime;
>    if( i2 >= prime2 )
>      i2 -= prime2;
>    mask  |= i-3;
>    mask2 |= i2-3;
>    i      = (i<<1)|1;
>    i2     = (i2<<1)|1;
>
>    if( i >= prime ) // 5x
>      i -= prime;
>    if( i2 >= prime2 )
>      i2 -= prime2;
>    mask  |= i-3;
>    mask2 |= i2-3;
>    i      = (i<<1)|1;
>    i2     = (i2<<1)|1;
>
>    if( i >= prime ) // 6x
>      i -= prime;
>    if( i2 >= prime2 )
>      i2 -= prime2;
>    mask  |= i-3;
>    mask2 |= i2-3;
>    i      = (i<<1)|1;
>    i2     = (i2<<1)|1;
>
>    if( i >= prime ) // 7x
>      i -= prime;
>    if( i2 >= prime2 )
>      i2 -= prime2;
>    mask  |= i-3;
>    mask2 |= i2-3;
>    i      = (i<<1)|1;
>    i2     = (i2<<1)|1;
>
>    if( i >= prime ) // 8x
>      i -= prime;
>    if( i2 >= prime2 )
>      i2 -= prime2;
>    mask  |= i-3;
>    mask2 |= i2-3;
>    i      = (i<<1)|1;
>    i2     = (i2<<1)|1;
>
>    if( i >= prime ) // 9x
>      i -= prime;
>    if( i2 >= prime2 )
>      i2 -= prime2;
>    mask  |= i-3;
>    mask2 |= i2-3;
>    i      = (i<<1)|1;
>    i2     = (i2<<1)|1;
>
>    if( i >= prime ) // 10x
>      i -= prime;
>    if( i2 >= prime2 )
>      i2 -= prime2;
>    mask  |= i-3;
>    mask2 |= i2-3;
>    i      = (i<<1)|1;
>    i2     = (i2<<1)|1;
>
>    if( mask == 0xffffffff ) { ...verfication for prime
>    ...
>   }
>
>  if( mask2 == 0xffffffff ) { ...verfication for prime2
>   ...
>  }
>
> Now i basically give the processor a program with a higher ILP 
> (instruction level parallellism,
> most writing in this list already realize that majority of her readers 
> and all the people who find this list at google,
> do not know what the abbreviations stand for), so it has a chance to 
> reschedule code internally now and
> execute effectively at a higher IPC (instruction per cycle).
>
> Effective speedup at K8 is nearly exactly 50% of this first brute 
> force layer factorisation by sieving.
>
> Such toying is more difficult at K7. The same code doesn't run 50% 
> faster at K7. Just a small % it runs faster
> at k7.
>
> I compared with 32 bits code, so the comparision of k7 vs k8 is very 
> fair, though in fact my sieve is basically doing layer 1
> factorisation with a brute force sieving by primes up to 64 bits.
>
> The real cause of that 50% speedup, whether k8 has for example better 
> rescheduling, or whether k7 has only a single execution unit that
> can do CMOV and k8 has 2, or whether it takes the branches faster, 
> that's not real clear to me. AMD for sure isn't going to comment on that.
>
> Of course i'll do a try with 3 primes simultaneously as well. But 50% 
> speedup kicks major butt.
>
> However just getting the idea of doing an optimization of this, any 
> non-commercial researcher ever done optimization like this?
> I bet out of a total of tens of thousands you can count the number of 
> programmers capable and carrying it out at 1 hand.
>
> Majority of the highend software simply needs highend because the 
> persons in questions are too lazy to either give the software to a good
> programmer to optimize it, don't want to pay someone for it, but in 
> the meantime their government spends millions to hardware
> that runs their software factor 50 slower and compensates by that 
> factor 50 slowdown by throwing factor 50 more cpu's into battle.
>
> The best optimized software typically runs single cpu in commercial 
> interfaces and the worst optimized software usually runs at a couple 
> of hundreds
> of nodes.
>
> Of course for those reading this posting, not YOUR software :)
>
>> microbenchmark suite, ex. cpu_rate) isn't finished, really, but it is
>> definitely objective.  netpipe is quite objective.  I've never heard
>> bonnie accused of being biased.  Stream is objective, and damn simple
>> code at that.  I really dunno about the new code being developed for
>> top500 testing, but I do trust to SOME extent the folks developing it --
>> the bias if any will be in that it is focussed on HPC and perhaps
>> certain classes of code WITHIN HPC.
>
>
> Well here is the big problem, what do you want to test with it?
>
> Like IMHO one important feature of highend is the effective latency of 
> node to
> node when the entire network is running the same program, even more than
> the effective bandwidth you have from node to node when entire machine
> functioning.
>
> A real problem is that totally embarrassingly software is the easiest 
> to donate.
>
> As soon as latency is a problem then it helps so so much to optimize 
> to the hardware
> architecture, that a testset could be called biassed.
>
> Like the parallellism i use in Diep is not biassed (it's not taking 
> advantage of figuring out
> which processes are less hops away than others), whereas i easily 
> could have done that,
> but that would only work for SGI in that case.
>
> Yet the program profits a lot from such things.
>
> On other hand i'm busy with a small attempt to see whether it's 
> possible to find large primes with
> other than mersenne numbers. For that i'm busy porting in my spare 
> time some FFT code in order
> to run it parallel.
>
> I'll have no option but to do that over a cluster with highend cards 
> having a good one way pingpong
> latency, because a single core just can't deliver enough calculation 
> power coming 10 years.
>
> So the parallellisation will be 2 layers. Both over the number of 
> cores that share the same memory
> in a rather fast manner, what i'm still studying at, is how to 
> parallellize such a thing over a fast latency
> network.
>
> This is rather interesting for a benchmark as we plan to opensource 
> the code.
>
> However we soon figured out that calculating in integers is way faster 
> for FFT than calcultaing with
> floating point.
>
> To give concrete examples:
>  a) in floating point you have at most 53 bits significance in a 64 
> bits double precision floating point
>  b) a multiplication in floating point gives 53 x 53 bits = 53 bits.
>      Means effectively you lost 53 bits in significance. Just 26 bits 
> of the original 53 bits could be
>      used. Whereas in 64 x 64 bits integers you get a 128 bits result. 
> Means you can use 64 bits
>      effectively. An imul (reg x reg) is at least as fast, if not 
> normally faster than a floating point multiply.
>  c) simple instructions like moving a value from 1 register to another 
> in the normal GPR (general programming registers)
>      can be executed at 3 instructions a cycle at K8 and 4 at core2. 
> At MMX/SSE/SSE2 this is roughly 4 cycles.
>
> The above will hold true for future even more, because the tiny 
> processors are winning bigtime. A dinosaur like itanium2
> of course can't compete with a k8 or core2. Just the production price 
> of such a giant chip like itanium2 is factors more of course,
> as its sheer size is factors bigger than the tiny sizes of an A64 dual 
> core (183mm^2) or a core2 dual core (141mm^2).
>
> This where the factory is doubling in price to build, so you simply 
> can't use a dedicated factory for a highend processor,
> as the revenue out of a highend processor is less than 1/10th of the 
> factory cost. So production of a highend giant dinosaur
> processor will be always done on outdated process technologies. 
> itanium2 will be using 0.09 when every pc processor is already 0.065 nm
> technology.
>
> So by Apollo as the Greeks would say, please let it be an integer 
> benchmark this time. It's all going to be pc processors in those
> supercomputers anyway.
>
> Many FFT codes, yes even when calculating at 10^-19 can be done with 
> integer codes a lot faster.
> Just no one got paid so far to do it.
>
> Measuring number of flops is totally useless with tiny pc processors.
> Number of integer operations a second is far more useful. Such codes 
> can run factor 6 faster.
>
>> Macro benchmarks are much tougher, as they are really in the category of
>> "is this program "like" my program" for the component(s) of the suite or
>> tool.  They also tend to be relatively "rich" code environments, where
>> compilers are tested as much or more than just the hardware.  This is
>> always true, I supposed, but it makes it difficult to separate
>> comparative analysis of different hardware from the compiler even when
>> the same compiler is used, especially across architectures.  Leaving
>> aside the issue of whether or not one SHOULD try to differentiate
>> compiler from the hardware.  Microbenchmarks are probably better for
>> that purpose as in many cases the core code fragments they time are
>> small enough that there isn't that much variation in their assembler
>> implementation, at any rate.
>>
>> I am a longstanding, fairly passionate advocate of totally open (GPL)
>> benchmarking code.  I was one of the folks who talked John into opening
>
>
> Sure, but who pays good programmers. A good programmer makes 100x more
> difference. It's real hard to find good codes to benchmark, because a 
> compiler
> team can totally screw you and show results that in reality just 
> aren't the case.
>
> I remember when P4 Xeon 1.7Ghz and K7 1.2Ghz MP released. Every single 
> person on planet
> earth the K7 was always faster than a P4.
>
> True, the memory subsystem of K7 was totally screwed, SO YOU PROGRAM 
> AROUND THAT.
>
> Somehow all kind of P4's managed to land as pretty good in benchmarks. 
> This because intel shipped
> those testers P4EE type chips with a 2 cycle or 3 cycle L1, and sold 
> in reality majority of P4 prescotts
> with a 4 cycle L1.
>
> History will repeat itself there always.
>
> I remember many hardware homepages who asked for Diep, to in the end 
> not use it in a test, because
> they wanted to promote a certain new processor (be it AMD or Intel or 
> some other manufacturer) and
> my software simply didn't run fast at it. Selective testing.
>
> That'll happen once again here of course.
>
>> up lmbench and liberalizing its fairly restrictive original license,
>> pointing out that the synergy obtained by freeing the code was of
>> greater importance than the control he was trying to maintain to prevent
>> vendor abuse.  In the end, I don't think anyone has abused lmbench in
>> part because it has never become a major marketing tool (unlike certain
>> other benchmark suites I can name).
>>
>> One of the things that has LONG irritated me is that SPEC isn't open
>> source, isn't extensible, isn't freely available.  I'm certain that
>> there are reasons for this -- they could even be "good" reasons.  That
>> doesn't mean I have to like it, or think that the world doesn't need a
>> truly open alternative.  However, this needs at least one human to "own"
>> the project of creating the alternative.  I myself cannot do it -- I
>> already own a languishing microbenchmark suite, a cluster monitoring
>> toolset, a random number tester, and an XML-based flashcard presentation
>> program, and none of them get enough attention as it is.  If somebody
>> DID put one together and own it, though, I'd contribute, help, clap
>> loudly, cheer.
>>    rgb
>
>
> Hah when running XML parsers you have to pay microsoft first as they 
> claim to have some rights there :)
>
> Vincent
>