[Beowulf] Nehalem and Shanghai code performance for our rzf example
Kevin Abbey
kabbey at biomaps.rutgers.edu
Sat Jan 17 10:14:12 PST 2009
Hi Joe,
Can that 9% difference be due to the Intel capability to overclock one
core and turn the others off?
Or is does this Intel feature require manual switch somewhere?
Thank you,
Kevin
Joe Landman wrote:
> Hi folks:
>
> Thought you might like to see this. I rewrote the interior loop for
> our Riemann Zeta Function (rzf) example for SSE2, and ran it on a
> Nehalem and on a Shanghai. This code is compute intensive. The inner
> loop which had been written as this (some small hand optimization,
> loop unrolling, etc):
>
> l[0]=(double)(inf-1 - 0);
> l[1]=(double)(inf-1 - 1);
> l[2]=(double)(inf-1 - 2);
> l[3]=(double)(inf-1 - 3);
> p_sum[0] = p_sum[1] = p_sum[2] = p_sum[3] = zero;
> for(k=start_index;k>end_index;k-=unroll)
> {
> d_pow[0] = l[0];
> d_pow[1] = l[1];
> d_pow[2] = l[2];
> d_pow[3] = l[3];
>
> for (m=n;m>1;m--)
> {
> d_pow[0] *= l[0];
> d_pow[1] *= l[1];
> d_pow[2] *= l[2];
> d_pow[3] *= l[3];
> }
> p_sum[0] += one/d_pow[0];
> p_sum[1] += one/d_pow[1];
> p_sum[2] += one/d_pow[2];
> p_sum[3] += one/d_pow[3];
>
> l[0]-=four;
> l[1]-=four;
> l[2]-=four;
> l[3]-=four;
> }
> sum = p_sum[0] + p_sum[1] + p_sum[2] + p_sum[3] ;
>
> has been rewritten as
>
> __m128d __P_SUM = _mm_set_pd1(0.0); // __P_SUM[0 ... VLEN] = 0
> __m128d __ONE = _mm_set_pd1(1.); // __ONE[0 ... VLEN] = 1
> __m128d __DEC = _mm_set_pd1((double)VLEN);
> __m128d __L = _mm_load_pd(l);
>
> for(k=start_index;k>end_index;k-=unroll)
> {
> __D_POW = __L;
>
> for (m=n;m>1;m--)
> {
> __D_POW = _mm_mul_pd(__D_POW, __L);
> }
>
> __P_SUM = _mm_add_pd(__P_SUM, _mm_div_pd(__ONE,
> __D_POW));
>
> __L = _mm_sub_pd(__L, __DEC);
>
> }
>
> _mm_store_pd(p_sum,__P_SUM);
>
> for(k=0;k<VLEN;k++)
> {
> sum += p_sum[k];
> }
>
> The two codes were run on a Nehalem 3.2 GHz (desktop) processor, and a
> Shanghai 2.3 GHz desktop processor. Here are the results
>
> Code CPU Freq (GHz) Wall clock (s)
> ------ ------- ------------- --------------
>
> base Nehalem 3.2 20.5
> optimized Nehalem 3.2 6.72
> SSE-ized Nehalem 3.2 3.37
>
> base Shanghai 2.3 30.3
> optimized Shanghai 2.3 7.36
> SSE-ized Shanghai 2.3 3.68
>
> These are single thread, single core runs. Code scales very well (is
> one of our example codes for the HPC/programming/parallelization
> classes we do).
>
> I found it interesting that they started out with the baseline code
> performance tracking the ratio of clock speeds ... The Nehalem has a
> 39% faster clock, and showed 48% faster performance, which is about 9%
> more than could be accounted for by clock speed alone. The SSE code
> performance appears to be about 9% different.
>
> I am sure lots of interesting points can be made out of this (being
> only one test, and not the most typical test/use case either, such
> points may be of dubious value).
>
> I am working on a Cuda version of the above as well, and will try to
> compare this to the threaded versions of the above. I am curious what
> we can achieve.
>
> Joe
>
--
Kevin C. Abbey
System Administrator
Rutgers University - BioMaPS Institute
Email: kabbey at biomaps.rutgers.edu
Hill Center - Room 259
110 Frelinghuysen Road
Piscataway, NJ 08854
Phone and Voice mail: 732-445-3288
Wright-Rieman Laboratories Room 201
610 Taylor Rd.
Piscataway, NJ 08854-8087
Phone: 732-445-2069
Fax: 732-445-5958
More information about the Beowulf
mailing list