# [Beowulf] Nehalem and Shanghai code performance for our rzf example

Joe Landman landman at scalableinformatics.com
Fri Jan 16 06:25:49 PST 2009

Hi folks:

Thought you might like to see this.  I rewrote the interior loop for
our Riemann Zeta Function (rzf) example for SSE2, and ran it on a
Nehalem and on a Shanghai.  This code is compute intensive.  The inner
loop which had been written as this (some small hand optimization, loop
unrolling, etc):

l[0]=(double)(inf-1 - 0);
l[1]=(double)(inf-1 - 1);
l[2]=(double)(inf-1 - 2);
l[3]=(double)(inf-1 - 3);
p_sum[0] = p_sum[1] = p_sum[2] = p_sum[3] = zero;
for(k=start_index;k>end_index;k-=unroll)
{
d_pow[0] = l[0];
d_pow[1] = l[1];
d_pow[2] = l[2];
d_pow[3] = l[3];

for (m=n;m>1;m--)
{
d_pow[0] *=  l[0];
d_pow[1] *=  l[1];
d_pow[2] *=  l[2];
d_pow[3] *=  l[3];
}
p_sum[0] += one/d_pow[0];
p_sum[1] += one/d_pow[1];
p_sum[2] += one/d_pow[2];
p_sum[3] += one/d_pow[3];

l[0]-=four;
l[1]-=four;
l[2]-=four;
l[3]-=four;
}
sum = p_sum[0] + p_sum[1] + p_sum[2] + p_sum[3] ;

has been rewritten as

__m128d __P_SUM = _mm_set_pd1(0.0);        // __P_SUM[0 ... VLEN] = 0
__m128d __ONE = _mm_set_pd1(1.);   // __ONE[0 ... VLEN] = 1
__m128d __DEC = _mm_set_pd1((double)VLEN);

for(k=start_index;k>end_index;k-=unroll)
{
__D_POW       = __L;

for (m=n;m>1;m--)
{
__D_POW    = _mm_mul_pd(__D_POW, __L);
}

__L           = _mm_sub_pd(__L, __DEC);

}

_mm_store_pd(p_sum,__P_SUM);

for(k=0;k<VLEN;k++)
{
sum += p_sum[k];
}

The two codes were run on a Nehalem 3.2 GHz (desktop) processor, and a
Shanghai 2.3 GHz desktop processor.  Here are the results

Code		CPU	Freq (GHz)	Wall clock (s)
------		-------	-------------	--------------

base		Nehalem	3.2		20.5
optimized	Nehalem	3.2		6.72
SSE-ized	Nehalem	3.2		3.37

base		Shanghai 2.3		30.3
optimized	Shanghai 2.3		7.36
SSE-ized	Shanghai 2.3		3.68

These are single thread, single core runs.  Code scales very well (is
one of our example codes for the HPC/programming/parallelization classes
we do).

I found it interesting that they started out with the baseline code
performance tracking the ratio of clock speeds ... The Nehalem has a 39%
faster clock, and showed 48% faster performance, which is about 9% more
than could be accounted for by clock speed alone.  The SSE code
performance appears to be about 9% different.

I am sure lots of interesting points can be made out of this (being only
one test, and not the most typical test/use case either, such points may
be of dubious value).

I am working on a Cuda version of the above as well, and will try to
compare this to the threaded versions of the above.  I am curious what
we can achieve.

Joe

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615