[Beowulf] Nehalem and Shanghai code performance for our rzf example
Joe Landman
landman at scalableinformatics.com
Fri Jan 16 06:25:49 PST 2009
Hi folks:
Thought you might like to see this. I rewrote the interior loop for
our Riemann Zeta Function (rzf) example for SSE2, and ran it on a
Nehalem and on a Shanghai. This code is compute intensive. The inner
loop which had been written as this (some small hand optimization, loop
unrolling, etc):
l[0]=(double)(inf-1 - 0);
l[1]=(double)(inf-1 - 1);
l[2]=(double)(inf-1 - 2);
l[3]=(double)(inf-1 - 3);
p_sum[0] = p_sum[1] = p_sum[2] = p_sum[3] = zero;
for(k=start_index;k>end_index;k-=unroll)
{
d_pow[0] = l[0];
d_pow[1] = l[1];
d_pow[2] = l[2];
d_pow[3] = l[3];
for (m=n;m>1;m--)
{
d_pow[0] *= l[0];
d_pow[1] *= l[1];
d_pow[2] *= l[2];
d_pow[3] *= l[3];
}
p_sum[0] += one/d_pow[0];
p_sum[1] += one/d_pow[1];
p_sum[2] += one/d_pow[2];
p_sum[3] += one/d_pow[3];
l[0]-=four;
l[1]-=four;
l[2]-=four;
l[3]-=four;
}
sum = p_sum[0] + p_sum[1] + p_sum[2] + p_sum[3] ;
has been rewritten as
__m128d __P_SUM = _mm_set_pd1(0.0); // __P_SUM[0 ... VLEN] = 0
__m128d __ONE = _mm_set_pd1(1.); // __ONE[0 ... VLEN] = 1
__m128d __DEC = _mm_set_pd1((double)VLEN);
__m128d __L = _mm_load_pd(l);
for(k=start_index;k>end_index;k-=unroll)
{
__D_POW = __L;
for (m=n;m>1;m--)
{
__D_POW = _mm_mul_pd(__D_POW, __L);
}
__P_SUM = _mm_add_pd(__P_SUM, _mm_div_pd(__ONE, __D_POW));
__L = _mm_sub_pd(__L, __DEC);
}
_mm_store_pd(p_sum,__P_SUM);
for(k=0;k<VLEN;k++)
{
sum += p_sum[k];
}
The two codes were run on a Nehalem 3.2 GHz (desktop) processor, and a
Shanghai 2.3 GHz desktop processor. Here are the results
Code CPU Freq (GHz) Wall clock (s)
------ ------- ------------- --------------
base Nehalem 3.2 20.5
optimized Nehalem 3.2 6.72
SSE-ized Nehalem 3.2 3.37
base Shanghai 2.3 30.3
optimized Shanghai 2.3 7.36
SSE-ized Shanghai 2.3 3.68
These are single thread, single core runs. Code scales very well (is
one of our example codes for the HPC/programming/parallelization classes
we do).
I found it interesting that they started out with the baseline code
performance tracking the ratio of clock speeds ... The Nehalem has a 39%
faster clock, and showed 48% faster performance, which is about 9% more
than could be accounted for by clock speed alone. The SSE code
performance appears to be about 9% different.
I am sure lots of interesting points can be made out of this (being only
one test, and not the most typical test/use case either, such points may
be of dubious value).
I am working on a Cuda version of the above as well, and will try to
compare this to the threaded versions of the above. I am curious what
we can achieve.
Joe
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
More information about the Beowulf
mailing list