[Beowulf] AMD64 results...

Thu Dec 16 07:16:45 PST 2004

All,

Here are the data again comparing gcc, PGI, the Pathscale compilers on 
our cluster and
Bill's Opteron with prefetching turned on in PGI and gcc as well.  Our 
system has the
same clock as Bill's, 2.2 GHz, but slower memory (PC2700).  I have 
thrown in some
X1 SSP timings are well.  The numbers demonstrate the importance of 
explicitly asking for
prefetching on the non-Pathscale compilers.   Pathscale still comes out 
on top (at about
half the X1 SSP rate) here, but the numbers are now much closer, and 
these differences
may be somewhat accounted for by Bill's system's faster memory (PC32000 
versus PC2700
for our system). 

I include the X1 single SSP data as well.  Of course if you are focused 
on raw bandwidth,
you should get numbers with and without prefetching otherwise you are 
silently including
cache effects.

The equivalent *one processo*r megaflop ratings for the triad data below 
are:

gcc (noprefetch):        186 MFLOPs
gcc (prefetch):            279 MFLOPs
pgcc (prefetch):          300 MFLOPs
pscalecc (prefetch):    347 MFLOPs
x1cc (vector, 1ssp):    780 MFLOPs

Dual processor ratings should be close to double these on the Opteron.  
So I expect one
node (two CPUs) on the Opteron is almost equal one SSP on the X1. 

Enjoy and prefetch!

rbw

gcc-3.2.3  -O4 -Wall -pedantic:
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:        2004.8056       0.0095       0.0080       0.0099
Scale:       2044.7551       0.0099       0.0078       0.0105
Add:         2272.3092       0.0133       0.0106       0.0137
Triad:       2237.3599       0.0134       0.0107       0.0137

gcc-3.2.3  -O4 -fprefetch-loop-arrays -Wall -pedantic:
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:        3259.9273       0.0049       0.0049       0.0052
Scale:       3294.9803       0.0049       0.0049       0.0049
Add:         3306.7241       0.0073       0.0073       0.0073
Triad:       3349.1914       0.0072       0.0072       0.0072

pgcc  -fast -Mvect=sse -Mnontemporal
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:        3227.6291       0.0050       0.0050       0.0052
Scale:       3210.1824       0.0050       0.0050       0.0050
Add:         3571.3935       0.0067       0.0067       0.0068
Triad:       3604.1280       0.0067       0.0067       0.0068

Pathscale-1.4 -O3
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        3764.6831       0.1540       0.1700       0.1800
Scale:       3764.6831       0.1530       0.1700       0.1700
Add:         4173.8781       0.2080       0.2300       0.2400
Triad:       4173.8781       0.2110       0.2300       0.2400

X1cc -c -h inline3,scalar3,vector3 -h stream0
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:        7600.2280       0.0022       0.0021       0.0022
Scale:       7600.5529       0.0024       0.0021       0.0030
Add:         9259.1164       0.0026       0.0026       0.0027
Triad:       9360.5935       0.0026       0.0026       0.0026

Greg Lindahl wrote:

>On Wed, Dec 15, 2004 at 06:29:56PM -0800, Bill Broadley wrote:
>
>  
>
>>Kudos for the pathscale-1.4 compiler with -O3.
>>    
>>
>
>Thank you! The not-so-secret secret is to use non-temporal stores,
>which we do automagically where needed with plain -O3.
>
>-- greg
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
>  
>