[Beowulf] Weird blade performs worse as more cpus are used?
joe.landman at gmail.com
Thu Sep 14 08:42:44 PDT 2017
On 09/14/2017 11:34 AM, Faraz Hussain wrote:
> Earlier I had posted about one of our blades running 30-50% slower
> than other ones despite having identical hardware and OS. I followed
> the suggestions and compared cpu temperature, memory, dmesg and
> sysctl. Everything looks the same.
> I then used "perf stat" to compare speed of pigz ( parralel gzip ).
> The results are quite interesting. Using one cpu, the slow blade is as
> fast as the rest! But as I use more cpus, the speed decreases linearly
> from 3.1Ghz to 0.4 Ghz. See snippets from "perf stat" command below.
> All tests were on /tmp to eliminate any nfs issue. And same behavior
> is observed with any multi-threaded program.
What does numastat report? /tmp is a ramdisk or tmpfs? Are the
nodes/cpus otherwise idle? What does lscpu on a good/bad node report?
If it decreases on a 1/Ncpu curve, then you have a fixed sized resource
bandwidth contention issue you are fighting. The question is what.
e: joe.landman at gmail.com
More information about the Beowulf