[Beowulf] Nehalem and Shanghai code performance for our rzf example
Vincent Diepeveen
diep at xs4all.nl
Fri Jan 16 09:39:32 PST 2009
Note that single threaded performance doesn't say a thing,
because when just 1 core runs, nehalem automatically overclocks 1 core.
A very nasty feature.
My experience is that Shanghai scales 4.0 nearly versus nehalem 3.2,
because of the overclocking of 1 core.
So seeing a 9% higher IPC is not very weird.
Thanks,
Vincent
On Jan 16, 2009, at 3:25 PM, Joe Landman wrote:
> Hi folks:
>
> Thought you might like to see this. I rewrote the interior loop
> for our Riemann Zeta Function (rzf) example for SSE2, and ran it on
> a Nehalem and on a Shanghai. This code is compute intensive. The
> inner loop which had been written as this (some small hand
> optimization, loop unrolling, etc):
>
> l[0]=(double)(inf-1 - 0);
> l[1]=(double)(inf-1 - 1);
> l[2]=(double)(inf-1 - 2);
> l[3]=(double)(inf-1 - 3);
> p_sum[0] = p_sum[1] = p_sum[2] = p_sum[3] = zero;
> for(k=start_index;k>end_index;k-=unroll)
> {
> d_pow[0] = l[0];
> d_pow[1] = l[1];
> d_pow[2] = l[2];
> d_pow[3] = l[3];
>
> for (m=n;m>1;m--)
> {
> d_pow[0] *= l[0];
> d_pow[1] *= l[1];
> d_pow[2] *= l[2];
> d_pow[3] *= l[3];
> }
> p_sum[0] += one/d_pow[0];
> p_sum[1] += one/d_pow[1];
> p_sum[2] += one/d_pow[2];
> p_sum[3] += one/d_pow[3];
>
> l[0]-=four;
> l[1]-=four;
> l[2]-=four;
> l[3]-=four;
> }
> sum = p_sum[0] + p_sum[1] + p_sum[2] + p_sum[3] ;
>
> has been rewritten as
>
> __m128d __P_SUM = _mm_set_pd1(0.0); // __P_SUM[0 ...
> VLEN] = 0
> __m128d __ONE = _mm_set_pd1(1.); // __ONE[0 ... VLEN] = 1
> __m128d __DEC = _mm_set_pd1((double)VLEN);
> __m128d __L = _mm_load_pd(l);
>
> for(k=start_index;k>end_index;k-=unroll)
> {
> __D_POW = __L;
>
> for (m=n;m>1;m--)
> {
> __D_POW = _mm_mul_pd(__D_POW, __L);
> }
>
> __P_SUM = _mm_add_pd(__P_SUM, _mm_div_pd(__ONE,
> __D_POW));
>
> __L = _mm_sub_pd(__L, __DEC);
>
> }
>
> _mm_store_pd(p_sum,__P_SUM);
>
> for(k=0;k<VLEN;k++)
> {
> sum += p_sum[k];
> }
>
> The two codes were run on a Nehalem 3.2 GHz (desktop) processor,
> and a Shanghai 2.3 GHz desktop processor. Here are the results
>
> Code CPU Freq (GHz) Wall clock (s)
> ------ ------- ------------- --------------
>
> base Nehalem 3.2 20.5
> optimized Nehalem 3.2 6.72
> SSE-ized Nehalem 3.2 3.37
>
> base Shanghai 2.3 30.3
> optimized Shanghai 2.3 7.36
> SSE-ized Shanghai 2.3 3.68
>
> These are single thread, single core runs. Code scales very well
> (is one of our example codes for the HPC/programming/
> parallelization classes we do).
>
> I found it interesting that they started out with the baseline code
> performance tracking the ratio of clock speeds ... The Nehalem has
> a 39% faster clock, and showed 48% faster performance, which is
> about 9% more than could be accounted for by clock speed alone.
> The SSE code performance appears to be about 9% different.
>
> I am sure lots of interesting points can be made out of this (being
> only one test, and not the most typical test/use case either, such
> points may be of dubious value).
>
> I am working on a Cuda version of the above as well, and will try
> to compare this to the threaded versions of the above. I am
> curious what we can achieve.
>
> Joe
>
> --
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics LLC,
> email: landman at scalableinformatics.com
> web : http://www.scalableinformatics.com
> http://jackrabbit.scalableinformatics.com
> phone: +1 734 786 8423 x121
> fax : +1 866 888 3112
> cell : +1 734 612 4615
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list