<div dir="ltr"><div><div><br></div>Hi Joe,<br><br></div><div><div><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="">
Median changes by more than factor of 2. And the distribution tail is *huge*.<br></div>
FWIW: 6.2 was a terrible release. If you have to use pure RHEL, get to 6.5+. And there are many tunables you need to look at.<br></blockquote><div><br></div><div> Thanks for your reply - I may look into asking our IT squad to put 6.5 on a set of nodes for testing, but playing with the tunables is probably the first step. I don't have root access and can't switch things up, but a few of the power options (eg, /sys/module/pcie_aspm/parameters/policy) are already looking like decent things to switch around, as that's in a 'power save' state currently on the poorly performing nodes, whereas it doesn't even exist on the 5.5 nodes. <br>
</div><div> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Bigger view ... have you isolated a CPU for IB handling, so at 7 cores, your machine is full (1 for IB and 7 for apps), but at 8 cores you are contending for resources (8 for apps + 1 for IB)?<br>
Are you running the app with taskset (explicitly or implicitly)?<span class="HOEnZb"><font color="#888888"><br></font></span></blockquote><div><br></div><div> In the test we're running, there isn't any local processing outside of the communication, really - each task, bound to its own core, is simply sending messages, in a giant loop. While there are clearly 8 cores all talking to 1 IB device, each one (I believe) mmaps its own range and handles its own message processing, and furthermore this definitely works before, so it doesn't seem like a resource contention issue unless it's something to do with mmap on the versions we're running. I did double check that we're not having processes migrating between cores, though. <br>
<br></div><div> Mostly, I'm poking around kernel tunables right now and making a list of things that might indicate the issue. I'll also take a deeper look at /proc/interrupts during a run soon, too.<br><br></div>
<div> Thanks again,<br></div><div> - Brian<br></div><div> <br></div></div></div></div></div></div>