[Beowulf] Poor bandwith from one compute node

John Hearns hearnsj at googlemail.com
Thu Aug 17 09:28:19 PDT 2017


Faraz,
I really suggest you examine the Intel Cluster Checker.
I guess that you cannot take down a production cluster to run an entire
Cluster checker run, however these are the types of faults which ICC is
designed to find. You can define a smal lset of compute nodes to run on,
including this node, and maybe run ICC on them?

As for the diagnosis,   run      ethtool     <interface name>    where that
is the name of your ethernet interface.
compare that with the output of ethtool on a properly working compute node.









On 17 August 2017 at 18:00, Faraz Hussain <info at feacluster.com> wrote:

> I noticed an mpi job was taking 5X longer to run whenever it got the
> compute node lusytp104 . So I ran qperf and found the bandwidth between it
> and any other nodes was ~100MB/sec. This is much lower than ~1GB/sec
> between all the other nodes. Any tips on how to debug further? I haven't
> tried rebooting since it is currently running a single-node job.
>
> [hussaif1 at lusytp114 ~]$ qperf lusytp104 tcp_lat tcp_bw
> tcp_lat:
>     latency  =  17.4 us
> tcp_bw:
>     bw  =  118 MB/sec
> [hussaif1 at lusytp114 ~]$ qperf lusytp113 tcp_lat tcp_bw
> tcp_lat:
>     latency  =  20.4 us
> tcp_bw:
>     bw  =  1.07 GB/sec
>
> This is separate issue from my previous post about a slow compute node. I
> am still investigating that per the helpful replies. Will post an update
> about that once I find the root cause!
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20170817/0c4f00d9/attachment.html>


More information about the Beowulf mailing list