[Beowulf] Strange Opteron 2350 performance: Gaussian-03
Mikhail Kuzminsky
kus at free.net
Sat Jun 28 15:30:54 PDT 2008
In message from Joe Landman <landman at scalableinformatics.com> (Sat, 28
Jun 2008 14:48:02 -0400):>
> This is possible, depending upon the compiler used. Though I have
>to
>admit that I find it odd that it would be the case within the Opteron
>family and not between Opteron and Xeon.
>
> Intel compilers used to (haven't checked 10.1) switch between fast
>(SSE*) and slow (x87 FP) paths as a function of a processor version
>string. If this is an old Intel compiler built code, this is
>possible that the code paths may be different, though as noted, I
>would find that surprising if this were the case within the Opteron
>family.
Well, I thought about (absense of) using of SSE in binary Gaussian 03
Rev.C02 version
I used, but even if x87-codes were really generated by pgf77 - why
this x87-based codes gives such "high" performance on Opteron 246 in
comparison w/Opteron 2350 core ? On both CPUs I ran the same binary
Gaussian codes !
> Modern PGI compilers (suggested default for Gaussian-03 last I
>checked) have the ability to do this as well, though I don't know how
>they implement it (capability testing hopefully?)
>
> Out of curiousity, how does streams run on both systems?
I ran stream on Opteron 242 and 244 few years ago. The scalability and
the troughput itself was OK. Currently I ran stream on my Opteron
2350-based dual-socket server. In accordance w/more fast DDR2-667 I
obtained more high throughput. I reproduced in particular 8-cores
result presented in McCalpin's table (sent from AMD), and some data
presented early on our Beowulf maillist.
(BTW, there is one bad thing for stream on this server - the
corresponding data are absent in McCalpin's table: the throughput is
scaled good from 1 to 2 OpenMP threads, and gives good result for 8
threads, but the throughput for 4 threads is about the same as for 2
threads. The reason is, IMHO, that for 8 threads RAM is allocated by
kernel in both nodes, but for 4 threads the RAM allocated is placed in
one node, and 4 threads have bad competition for memory access).
Taking into account that Gaussian-03 was bad on Opteron 2350 core - in
sequential run, Opteron 2350 RAM gives it only pluses in comparison
w/Opteron 246. I didn't run stream on Opteron 246, but it's clear for
me.
> Also, it
>is
>possible, with a larger cache, that you might be running into some
>odd cache effects (tlb/page thrashing). But DFTs are usually "small"
>and thus "sensitive" to cache size.
>
> You might be able to instrument the run within a papi wrapper, and
>see if you observe a large number of cache/tlb flushes for some
>reason.
>
> On a related note: are you using a stepping before B3 of 2350?
> That
>could impact performance, if you have the patch in place or have the
>tlb/cache turned off in bios (some MB makers created a patch to do
>this).
Gaussian-03 fails in link302 on Barcelona B2 because of this error. I
use stepping B3.
Yours
Mikhail
>
>Joe
>
>
>--
>Joseph Landman, Ph.D
>Founder and CEO
>Scalable Informatics LLC,
>email: landman at scalableinformatics.com
>web : http://www.scalableinformatics.com
> http://jackrabbit.scalableinformatics.com
>phone: +1 734 786 8423
>fax : +1 866 888 3112
>cell : +1 734 612 4615
More information about the Beowulf
mailing list