[Beowulf] Nehalem and Shanghai code performance for our rzf example

Fri Jan 16 09:39:32 PST 2009

Note that single threaded performance doesn't say a thing,
because when just 1 core runs, nehalem automatically overclocks 1 core.

A very nasty feature.

My experience is that Shanghai scales 4.0 nearly versus nehalem 3.2,
because of the overclocking of 1 core.

So seeing a 9% higher IPC is not very weird.

Thanks,
Vincent

On Jan 16, 2009, at 3:25 PM, Joe Landman wrote:

> Hi folks:
>
>   Thought you might like to see this.  I rewrote the interior loop  
> for our Riemann Zeta Function (rzf) example for SSE2, and ran it on  
> a Nehalem and on a Shanghai.  This code is compute intensive.  The  
> inner loop which had been written as this (some small hand  
> optimization, loop unrolling, etc):
>
>     l[0]=(double)(inf-1 - 0);
>     l[1]=(double)(inf-1 - 1);
>     l[2]=(double)(inf-1 - 2);
>     l[3]=(double)(inf-1 - 3);
>     p_sum[0] = p_sum[1] = p_sum[2] = p_sum[3] = zero;
>     for(k=start_index;k>end_index;k-=unroll)
>        {
>           d_pow[0] = l[0];
>           d_pow[1] = l[1];
>           d_pow[2] = l[2];
>           d_pow[3] = l[3];
>
>           for (m=n;m>1;m--)
>            {
>              d_pow[0] *=  l[0];
>              d_pow[1] *=  l[1];
>              d_pow[2] *=  l[2];
>              d_pow[3] *=  l[3];
>            }
>           p_sum[0] += one/d_pow[0];
>           p_sum[1] += one/d_pow[1];
>           p_sum[2] += one/d_pow[2];
>           p_sum[3] += one/d_pow[3];
>
>           l[0]-=four;
>           l[1]-=four;
>           l[2]-=four;
>           l[3]-=four;
>       }
>     sum = p_sum[0] + p_sum[1] + p_sum[2] + p_sum[3] ;
>
> has been rewritten as
>
>     __m128d __P_SUM = _mm_set_pd1(0.0);        // __P_SUM[0 ...  
> VLEN] = 0
>     __m128d __ONE = _mm_set_pd1(1.);   // __ONE[0 ... VLEN] = 1
>     __m128d __DEC = _mm_set_pd1((double)VLEN);
>     __m128d __L   = _mm_load_pd(l);
>
>     for(k=start_index;k>end_index;k-=unroll)
>        {
>           __D_POW       = __L;
>
>           for (m=n;m>1;m--)
>            {
>              __D_POW    = _mm_mul_pd(__D_POW, __L);
>            }
>
>           __P_SUM       = _mm_add_pd(__P_SUM, _mm_div_pd(__ONE,  
> __D_POW));
>
>           __L           = _mm_sub_pd(__L, __DEC);
>
>       }
>
>     _mm_store_pd(p_sum,__P_SUM);
>
>     for(k=0;k<VLEN;k++)
>      {
>        sum += p_sum[k];
>      }
>
> The two codes were run on a Nehalem 3.2 GHz (desktop) processor,  
> and a Shanghai 2.3 GHz desktop processor.  Here are the results
>
> 	Code		CPU	Freq (GHz)	Wall clock (s)
> 	------		-------	-------------	--------------
>
> 	base		Nehalem	3.2		20.5		
> 	optimized	Nehalem	3.2		6.72		
> 	SSE-ized	Nehalem	3.2		3.37
>
> 	base		Shanghai 2.3		30.3
> 	optimized	Shanghai 2.3		7.36 		
> 	SSE-ized	Shanghai 2.3		3.68
> 	
> These are single thread, single core runs.  Code scales very well  
> (is one of our example codes for the HPC/programming/ 
> parallelization classes we do).
>
> I found it interesting that they started out with the baseline code  
> performance tracking the ratio of clock speeds ... The Nehalem has  
> a 39% faster clock, and showed 48% faster performance, which is  
> about 9% more than could be accounted for by clock speed alone.   
> The SSE code performance appears to be about 9% different.
>
> I am sure lots of interesting points can be made out of this (being  
> only one test, and not the most typical test/use case either, such  
> points may be of dubious value).
>
> I am working on a Cuda version of the above as well, and will try  
> to compare this to the threaded versions of the above.  I am  
> curious what we can achieve.
>
> Joe
>
> -- 
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics LLC,
> email: landman at scalableinformatics.com
> web  : http://www.scalableinformatics.com
>        http://jackrabbit.scalableinformatics.com
> phone: +1 734 786 8423 x121
> fax  : +1 866 888 3112
> cell : +1 734 612 4615
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf