[Beowulf] Good IB network performance when using 7 cores, poor performance on all 8?

Thu Apr 24 08:31:56 PDT 2014

Hi everyone,

  We're having a problem with one of our clusters after it was upgraded to
RH6.2 (from CentOS5.5) - the performance of our Infiniband network degrades
randomly and severely when using all 8 cores in our nodes for MPI,... but
not when using only 7 cores per node.

  For example, I have a hacked-together script (below) that does a sequence
of 20 sets of fifty MPI_Allreduce tests via the Intel MPI benchmarks, and
then calculates statistics on the average times per individual set.  For
our 'good' (CentOS 5.5) nodes, we see consistent results:

% perftest hosts_c20_8c.txt
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  176.0   177.3   182.6   182.8   186.1   196.9
% perftest hosts_c20_8c.txt
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  176.3   180.4   184.8   187.0   189.1   213.5

  ... But for our tests on the RH6.2 install, we see enormous variance:

% perftest hosts_c18_8c.txt
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  176.8   185.9   217.0   347.6   387.7  1242.0
% perftest hosts_c18_8c.txt
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  178.2   204.5   390.5   329.6   409.4   493.1

  Note that the minimums are similar -- not *every* run experiences this
jitter - and in the case of the first run of the script, even the median
value is pretty decent, so seemingly only a few of the tests were high.
But the maximum is enormous.  Each of these tests are run one right after
the other, and strangely it seems to always differ between *instances* of
the IMB code, not in individual loops -eg, one of the fifty runs inside an
individual call.  Those all seem consistent, so that's either luck, or some
issue on mapping the IB device, or some interrupt issue in the kernel, etc.

  If I then run the same exact test but with only 7 cores per node, the
problem vanishes again:

% perftest hosts_c18_7c.txt
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  186.7   192.6   197.0   198.5   199.6   226.2

  The IB devices are QLogic IBA7322 cards and all processes are binding to
unique cores.  We've run with OpenMPI 1.6.4 + 1.8.0 and I also tested
MVAPICH2, all with the same results, so this isn't specific to the MPI
flavor.  The only difference between the good and bad nodes appears to be
the host OS install (including OFED differences).  Our IT guys are playing
with some options there, but if anyone has any sage advice I'm all ears.

  Many thanks,
  - Brian

---
Here's the little hacked-together script I'm using - the 'lengths' file is
just a text file with a line that says '65536':
#!/bin/bash
# Quick test for bad MPI performance.. called with 'perftest <hosts file>'

# Parameters:
NUMTESTS=20
MINPROC=64
IMB_EXE=~/test/IMB-MPI1
LENGTHS_FILE=~/test/lengths_file.txt

if [ "$#" -ne 1 ]; then
    echo "Usage: perftest <hostfile>"
    exit 1
fi
HOSTS=$1

# Main script:
for n in `seq 1 $NUMTESTS`;
do
  mpiexec -n $MINPROC --machinefile $HOSTS --bind-to-socket ${IMB_EXE}
Allreduce -npmin $MINPROC -multi 1 -msglen ${LENGTHS_FILE} -iter 50,50
-time 5.0
done | grep  "655"  | awk '{print $6}' | Rscript -e 'summary (as.numeric
(readLines ("stdin")))'
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20140424/391b71f0/attachment.html>