<div dir="ltr"><div>Faraz,</div><div>I really suggest you examine the Intel Cluster Checker.</div><div>I guess that you cannot take down a production cluster to run an entire Cluster checker run, however these are the types of faults which ICC is designed to find. You can define a smal lset of compute nodes to run on, including this node, and maybe run ICC on them?</div><div><br></div><div>As for the diagnosis, run ethtool <interface name> where that is the name of your ethernet interface.</div><div>compare that with the output of ethtool on a properly working compute node.</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On 17 August 2017 at 18:00, Faraz Hussain <span dir="ltr"><<a href="mailto:info@feacluster.com" target="_blank">info@feacluster.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I noticed an mpi job was taking 5X longer to run whenever it got the compute node lusytp104 . So I ran qperf and found the bandwidth between it and any other nodes was ~100MB/sec. This is much lower than ~1GB/sec between all the other nodes. Any tips on how to debug further? I haven't tried rebooting since it is currently running a single-node job.<br>
<br>
[hussaif1@lusytp114 ~]$ qperf lusytp104 tcp_lat tcp_bw<br>
tcp_lat:<br>
latency = 17.4 us<br>
tcp_bw:<br>
bw = 118 MB/sec<br>
[hussaif1@lusytp114 ~]$ qperf lusytp113 tcp_lat tcp_bw<br>
tcp_lat:<br>
latency = 20.4 us<br>
tcp_bw:<br>
bw = 1.07 GB/sec<br>
<br>
This is separate issue from my previous post about a slow compute node. I am still investigating that per the helpful replies. Will post an update about that once I find the root cause!<br>
<br>
______________________________<wbr>_________________<br>
Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>
To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" target="_blank" rel="noreferrer">http://www.beowulf.org/mailman<wbr>/listinfo/beowulf</a><br>
</blockquote></div><br></div>