[Beowulf] Nehalem and Shanghai code performance for our rzf example
    Joe Landman 
    landman at scalableinformatics.com
       
    Fri Jan 16 06:25:49 PST 2009
    
    
  
Hi folks:
   Thought you might like to see this.  I rewrote the interior loop for 
our Riemann Zeta Function (rzf) example for SSE2, and ran it on a 
Nehalem and on a Shanghai.  This code is compute intensive.  The inner 
loop which had been written as this (some small hand optimization, loop 
unrolling, etc):
     l[0]=(double)(inf-1 - 0);
     l[1]=(double)(inf-1 - 1);
     l[2]=(double)(inf-1 - 2);
     l[3]=(double)(inf-1 - 3);
     p_sum[0] = p_sum[1] = p_sum[2] = p_sum[3] = zero;
     for(k=start_index;k>end_index;k-=unroll)
        {
           d_pow[0] = l[0];
           d_pow[1] = l[1];
           d_pow[2] = l[2];
           d_pow[3] = l[3];
           for (m=n;m>1;m--)
            {
              d_pow[0] *=  l[0];
              d_pow[1] *=  l[1];
              d_pow[2] *=  l[2];
              d_pow[3] *=  l[3];
            }
           p_sum[0] += one/d_pow[0];
           p_sum[1] += one/d_pow[1];
           p_sum[2] += one/d_pow[2];
           p_sum[3] += one/d_pow[3];
           l[0]-=four;
           l[1]-=four;
           l[2]-=four;
           l[3]-=four;
       }
     sum = p_sum[0] + p_sum[1] + p_sum[2] + p_sum[3] ;
has been rewritten as
     __m128d __P_SUM = _mm_set_pd1(0.0);        // __P_SUM[0 ... VLEN] = 0
     __m128d __ONE = _mm_set_pd1(1.);   // __ONE[0 ... VLEN] = 1
     __m128d __DEC = _mm_set_pd1((double)VLEN);
     __m128d __L   = _mm_load_pd(l);
     for(k=start_index;k>end_index;k-=unroll)
        {
           __D_POW       = __L;
           for (m=n;m>1;m--)
            {
              __D_POW    = _mm_mul_pd(__D_POW, __L);
            }
           __P_SUM       = _mm_add_pd(__P_SUM, _mm_div_pd(__ONE, __D_POW));
           __L           = _mm_sub_pd(__L, __DEC);
       }
     _mm_store_pd(p_sum,__P_SUM);
     for(k=0;k<VLEN;k++)
      {
        sum += p_sum[k];
      }
The two codes were run on a Nehalem 3.2 GHz (desktop) processor, and a 
Shanghai 2.3 GHz desktop processor.  Here are the results
	Code		CPU	Freq (GHz)	Wall clock (s)
	------		-------	-------------	--------------
	base		Nehalem	3.2		20.5		
	optimized	Nehalem	3.2		6.72		
	SSE-ized	Nehalem	3.2		3.37
	base		Shanghai 2.3		30.3
	optimized	Shanghai 2.3		7.36 		
	SSE-ized	Shanghai 2.3		3.68
	
These are single thread, single core runs.  Code scales very well (is 
one of our example codes for the HPC/programming/parallelization classes 
we do).
I found it interesting that they started out with the baseline code 
performance tracking the ratio of clock speeds ... The Nehalem has a 39% 
faster clock, and showed 48% faster performance, which is about 9% more 
than could be accounted for by clock speed alone.  The SSE code 
performance appears to be about 9% different.
I am sure lots of interesting points can be made out of this (being only 
one test, and not the most typical test/use case either, such points may 
be of dubious value).
I am working on a Cuda version of the above as well, and will try to 
compare this to the threaded versions of the above.  I am curious what 
we can achieve.
Joe
-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
    
    
More information about the Beowulf
mailing list