[Beowulf] Poor bandwith from one compute node
    Faraz Hussain 
    info at feacluster.com
       
    Thu Aug 17 09:00:27 PDT 2017
    
    
  
I noticed an mpi job was taking 5X longer to run whenever it got the  
compute node lusytp104 . So I ran qperf and found the bandwidth  
between it and any other nodes was ~100MB/sec. This is much lower than  
~1GB/sec between all the other nodes. Any tips on how to debug  
further? I haven't tried rebooting since it is currently running a  
single-node job.
[hussaif1 at lusytp114 ~]$ qperf lusytp104 tcp_lat tcp_bw
tcp_lat:
     latency  =  17.4 us
tcp_bw:
     bw  =  118 MB/sec
[hussaif1 at lusytp114 ~]$ qperf lusytp113 tcp_lat tcp_bw
tcp_lat:
     latency  =  20.4 us
tcp_bw:
     bw  =  1.07 GB/sec
This is separate issue from my previous post about a slow compute  
node. I am still investigating that per the helpful replies. Will post  
an update about that once I find the root cause!
    
    
More information about the Beowulf
mailing list