[Beowulf] Strange Opteron 2350 performance: Gaussian-03

Sat Jun 28 15:30:54 PDT 2008

In message from Joe Landman <landman at scalableinformatics.com> (Sat, 28 
Jun 2008 14:48:02 -0400):>
>   This is possible, depending upon the compiler used.  Though I have 
>to 
>admit that I find it odd that it would be the case within the Opteron 
>family and not between Opteron and Xeon.
>
>   Intel compilers used to (haven't checked 10.1) switch between fast 
>(SSE*) and slow (x87 FP) paths as a function of a processor version 
>string.  If this is an old Intel compiler built code, this is 
>possible that the code paths may be different, though as noted, I 
>would find that surprising if this were the case within the Opteron 
>family.

Well, I thought about (absense of) using of SSE in binary Gaussian 03 
Rev.C02 version
I used, but even if x87-codes were really generated by pgf77 - why 
this x87-based codes gives such "high" performance on Opteron 246 in 
comparison w/Opteron 2350 core ? On both CPUs I ran the same binary 
Gaussian codes !

>   Modern PGI compilers (suggested default for Gaussian-03 last I 
>checked) have the ability to do this as well, though I don't know how 
>they implement it (capability testing hopefully?)
>
>   Out of curiousity, how does streams run on both systems? 

I ran stream on Opteron 242 and 244 few years ago. The scalability and 
the troughput itself was OK. Currently I ran stream on my Opteron 
2350-based dual-socket server. In accordance w/more fast DDR2-667 I 
obtained more high throughput. I reproduced in particular 8-cores 
result presented in McCalpin's table (sent from AMD), and some data 
presented early on our Beowulf maillist. 

(BTW, there is one bad thing for stream on this server - the 
corresponding data are absent in McCalpin's table: the throughput is 
scaled good from 1 to 2 OpenMP threads, and gives good result for 8 
threads, but the throughput for 4 threads is about the same as for 2 
threads. The reason is, IMHO, that for 8 threads RAM is allocated by 
kernel in both nodes, but for 4 threads the RAM allocated is placed in 
one node, and 4 threads have bad competition for memory access).   

Taking into account that Gaussian-03 was bad on Opteron 2350 core - in 
sequential run, Opteron 2350 RAM gives it only pluses in comparison 
w/Opteron 246. I didn't run stream on Opteron 246, but it's clear for 
me.

> Also, it 
>is 
>possible, with a larger cache, that you might be running into some 
>odd cache effects (tlb/page thrashing).  But DFTs are usually "small" 
>and thus "sensitive" to cache size.
>
>   You might be able to instrument the run within a papi wrapper, and 
>see if you observe a large number of cache/tlb flushes for some 
>reason.
>
>   On a related note:  are you using a stepping before B3 of 2350? 
> That 
>could impact performance, if you have the patch in place or have the 
>tlb/cache turned off in bios (some MB makers created a patch to do 
>this).

Gaussian-03 fails in link302 on Barcelona B2 because of this error. I 
use stepping B3. 

Yours
Mikhail

>
>Joe
>
>
>-- 
>Joseph Landman, Ph.D
>Founder and CEO
>Scalable Informatics LLC,
>email: landman at scalableinformatics.com
>web  : http://www.scalableinformatics.com
>        http://jackrabbit.scalableinformatics.com
>phone: +1 734 786 8423
>fax  : +1 866 888 3112
>cell : +1 734 612 4615