[Beowulf] Theoretical vs. Actual Performance
David Mathog
mathog at caltech.edu
Thu Feb 22 09:52:20 PST 2018
On Thu, 22 Feb 2018 09:37:54 -0500 Prentice Bisbal wrote:
> I found literature from AMD stating the
> theoretical performance of these processors is 282 GFLOPS, and my
> LINPACK performance isn't coming close to that (I get approximately
> ~33%
> of that).
That does seem low. Check the usual culprits:
1. CPU frequency adjust locked to lowest setting, or set to one which
adjusts and which then interacts poorly with the test software. You
know that the rated performance will have been measured with the CPU
locked to its highest frequency.
2. something else running, especially something which forces the test
program out of memory or file caches. I wouldn't expect this sort of
test to be IO bound to disk, but if it is, and hugepages are used,
enormous performance drops may be observed when the system decides to
move those around. I wouldn't put it past AMD or Intel to run these
sorts of tests with the test system stripped down to the bones. No
network, no logging, single user, etc. That is, absolutely nothing that
would compete for CPU time. (Just checked on one of our big systems.
ps -ef | wc shows 953 processes: 48 migration, 48 ksoftirqd, 49
stopper, 49 watchdog, 49 kintegrityd, 49 kblockd, 49 ata_sff, 49 md, 49
md_misc, 49 aio, 49 crypto, 49 kthrotld, 49 rpciod, 19 gdm (console
processes, even with no display attached at the moment and nobody logged
in there), 193 events, 12 of my processes, and 107 miscellaneous OS
processes.)
3. ulimit settings. /etc/security/limits.conf settings.
4. NUMA issues. Multithreaded programs have been observed which
allocate a large block of memory once, which ends up on one side of a
NUMA system and then start some or all of the threads on the other.
Those on the wrong side will run a variable amount slower than those on
the right side. If this is what is going on locking all threads to the
same side of the system (if it has just two sides) can speed things up a
bit. Assuming it isn't supposed to use all threads.
5. Different compiler/optimization. The vendor may have used a binary
which was tweaked to the Nth degree, perhaps even using profiling from
earlier runs to optimize the final run. If you are using a benchmark
number from AMD see if you can obtain the exact same version of the test
software that they used (which is maybe available), so that you can
eliminate this variable. Perhaps wherever they keep that they also have
a detailed description of the test system?
Regards,
David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
More information about the Beowulf
mailing list