[Beowulf] Poor bandwith from one compute node

Thu Aug 17 09:35:47 PDT 2017

On 08/17/2017 12:00 PM, Faraz Hussain wrote:
> I noticed an mpi job was taking 5X longer to run whenever it got the 
> compute node lusytp104 . So I ran qperf and found the bandwidth 
> between it and any other nodes was ~100MB/sec. This is much lower than 
> ~1GB/sec between all the other nodes. Any tips on how to debug 
> further? I haven't tried rebooting since it is currently running a 
> single-node job.
>
> [hussaif1 at lusytp114 ~]$ qperf lusytp104 tcp_lat tcp_bw
> tcp_lat:
>     latency  =  17.4 us
> tcp_bw:
>     bw  =  118 MB/sec
> [hussaif1 at lusytp114 ~]$ qperf lusytp113 tcp_lat tcp_bw
> tcp_lat:
>     latency  =  20.4 us
> tcp_bw:
>     bw  =  1.07 GB/sec
>
> This is separate issue from my previous post about a slow compute 
> node. I am still investigating that per the helpful replies. Will post 
> an update about that once I find the root cause!

Sounds very much like it is running over gigabit ethernet vs 
Infiniband.  Check to make sure it is using the right network ...

>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Joe Landman
e: joe.landman at gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman