[Beowulf] bizarre scaling behavior on a Nehalem

Thu Aug 13 17:09:24 PDT 2009

Tom Elken wrote:
> To add some details to what Christian says, the HPC Challenge version of
> STREAM uses dynamic arrays and is hard to optimize.  I don't know what's
> best with current compiler versions, but you could try some of these that
> were used in past HPCC submissions with your program, Bill:

Thanks for the heads up, I've checked the specbench.org compiler options for
hints on where to start with optimization flags, but I didn't know about the
dynamic stream.

Is the HPC challenge code open source?

> PathScale 2.2.1 on Opteron:
> Base OPT flags: -O3 -OPT:Ofast:fold_reassociate=0 
> STREAMFLAGS=-O3 -OPT:Ofast:fold_reassociate=0 -OPT:alias=restrict:align_unsafe=on -CG:movnti=1

Alas my pathscale license expired and I believe with sci-cortex's death (RIP)
I can't renew it.

I tried open64-4.2.2 with those flags and on a nehalem single socket:

$ opencc -O4 -fopenmp stream.c -o stream-open64 -static
$ opencc -O4 -fopenmp stream-malloc.c -o stream-open64-malloc -static

$ ./stream-open64
Total memory required = 457.8 MB.
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       22061.4958       0.0145       0.0145       0.0146
Scale:      22228.4705       0.0144       0.0144       0.0145
Add:        20659.2638       0.0233       0.0232       0.0233
Triad:      20511.0888       0.0235       0.0234       0.0235

Dynamic:
$ ./stream-open64-malloc

Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       14436.5155       0.0222       0.0222       0.0222
Scale:      14667.4821       0.0218       0.0218       0.0219
Add:        15739.7070       0.0305       0.0305       0.0305
Triad:      15770.7775       0.0305       0.0304       0.0305

> Intel C/C++ Compiler 10.1 on Harpertown CPUs:
> Base OPT flags:	 -O2 -xT -ansi-alias -ip -i-static
> Intel recently used
> Intel C/C++ Compiler 11.0.081 on Nehalem CPUs:
> 	 -O2 -xSSE4.2 -ansi-alias -ip
> and got good STREAM results in their HPCC submission on their ENdeavor cluster.

$ icc -O2 -xSSE4.2 -ansi-alias -ip -openmp stream.c -o stream-icc
$ icc -O2 -xSSE4.2 -ansi-alias -ip -openmp stream-malloc.c -o
stream-icc-malloc

$ ./stream-icc | grep ":"
STREAM version $Revision: 5.9 $
Copy:       14767.0512       0.0022       0.0022       0.0022
Scale:      14304.3513       0.0022       0.0022       0.0023
Add:        15503.3568       0.0031       0.0031       0.0031
Triad:      15613.9749       0.0031       0.0031       0.0031
$ ./stream-icc-malloc | grep ":"
STREAM version $Revision: 5.9 $
Copy:       14604.7582       0.0022       0.0022       0.0022
Scale:      14480.2814       0.0022       0.0022       0.0022
Add:        15414.3321       0.0031       0.0031       0.0031
Triad:      15738.4765       0.0031       0.0030       0.0031

So ICC does manage zero penalty, alas no faster than open64 with the penalty.

I'll attempt to track down the HPCC stream source code to see if their dynamic
arrays are any friendlier than mine (I just use malloc).

In any case many thanks for the pointer.

Oh, my dynamic tweak:
$ diff stream.c stream-malloc.c
43a44
> # include <stdlib.h>
97c98
< static double	a[N+OFFSET],
---
> /* static double	a[N+OFFSET],
99c100,102
< 		c[N+OFFSET];
---
> 		c[N+OFFSET]; */
>
> double *a, *b, *c;
134a138,142
>
>     a=(double *)malloc(sizeof(double)*(N+OFFSET));
>     b=(double *)malloc(sizeof(double)*(N+OFFSET));
>     c=(double *)malloc(sizeof(double)*(N+OFFSET));
>
283c291,293
<
---
>     free(a);
>     free(b);
>     free(c);