[Beowulf] Nehalem and Shanghai code performance for our rzf example

Fri Jan 16 06:25:49 PST 2009

Hi folks:

   Thought you might like to see this.  I rewrote the interior loop for 
our Riemann Zeta Function (rzf) example for SSE2, and ran it on a 
Nehalem and on a Shanghai.  This code is compute intensive.  The inner 
loop which had been written as this (some small hand optimization, loop 
unrolling, etc):

     l[0]=(double)(inf-1 - 0);
     l[1]=(double)(inf-1 - 1);
     l[2]=(double)(inf-1 - 2);
     l[3]=(double)(inf-1 - 3);
     p_sum[0] = p_sum[1] = p_sum[2] = p_sum[3] = zero;
     for(k=start_index;k>end_index;k-=unroll)
        {
           d_pow[0] = l[0];
           d_pow[1] = l[1];
           d_pow[2] = l[2];
           d_pow[3] = l[3];

           for (m=n;m>1;m--)
            {
              d_pow[0] *=  l[0];
              d_pow[1] *=  l[1];
              d_pow[2] *=  l[2];
              d_pow[3] *=  l[3];
            }
           p_sum[0] += one/d_pow[0];
           p_sum[1] += one/d_pow[1];
           p_sum[2] += one/d_pow[2];
           p_sum[3] += one/d_pow[3];

           l[0]-=four;
           l[1]-=four;
           l[2]-=four;
           l[3]-=four;
       }
     sum = p_sum[0] + p_sum[1] + p_sum[2] + p_sum[3] ;

has been rewritten as

     __m128d __P_SUM = _mm_set_pd1(0.0);        // __P_SUM[0 ... VLEN] = 0
     __m128d __ONE = _mm_set_pd1(1.);   // __ONE[0 ... VLEN] = 1
     __m128d __DEC = _mm_set_pd1((double)VLEN);
     __m128d __L   = _mm_load_pd(l);

     for(k=start_index;k>end_index;k-=unroll)
        {
           __D_POW       = __L;

           for (m=n;m>1;m--)
            {
              __D_POW    = _mm_mul_pd(__D_POW, __L);
            }

           __P_SUM       = _mm_add_pd(__P_SUM, _mm_div_pd(__ONE, __D_POW));

           __L           = _mm_sub_pd(__L, __DEC);

       }

     _mm_store_pd(p_sum,__P_SUM);

     for(k=0;k<VLEN;k++)
      {
        sum += p_sum[k];
      }

The two codes were run on a Nehalem 3.2 GHz (desktop) processor, and a 
Shanghai 2.3 GHz desktop processor.  Here are the results

	Code		CPU	Freq (GHz)	Wall clock (s)
	------		-------	-------------	--------------

	base		Nehalem	3.2		20.5		
	optimized	Nehalem	3.2		6.72		
	SSE-ized	Nehalem	3.2		3.37

	base		Shanghai 2.3		30.3
	optimized	Shanghai 2.3		7.36 		
	SSE-ized	Shanghai 2.3		3.68

These are single thread, single core runs.  Code scales very well (is 
one of our example codes for the HPC/programming/parallelization classes 
we do).

I found it interesting that they started out with the baseline code 
performance tracking the ratio of clock speeds ... The Nehalem has a 39% 
faster clock, and showed 48% faster performance, which is about 9% more 
than could be accounted for by clock speed alone.  The SSE code 
performance appears to be about 9% different.

I am sure lots of interesting points can be made out of this (being only 
one test, and not the most typical test/use case either, such points may 
be of dubious value).

I am working on a Cuda version of the above as well, and will try to 
compare this to the threaded versions of the above.  I am curious what 
we can achieve.

Joe

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615