[Beowulf] mpi alltoall help

Michael Di Domenico mdidomenico4 at gmail.com
Tue Oct 10 08:58:59 PDT 2017


i posted a copy of this to openmpi mailing list, but i'm curious if
anyone here can lend suggestions on troubleshooting

---

i'm getting stuck trying to run some fairly large IMB-MPI alltoall
tests under openmpi 2.0.2 on rhel 7.4

i have two different clusters, one running mellanox fdr10 and one
running qlogic qdr

if i issue

mpirun -n 1024 ./IMB-MPI1 -npmin 1024 -iter 1 -mem 2.001 alltoallv

the job just stalls after the "List of Benchmarks to run: Alltoallv"
line outputs from IMB-MPI

if i switch it to alltoall the test does progress

often when running various size alltoall's i'll get

"too many retries sending message to <>:<>, giving up

i'm able to use infiniband just fine (our lustre filesystem mounts
over it) and i have other mpi programs running

it only seems to stem when i run alltoall type primitives

any thoughts on debugging where the failures are, i might just need to
turn up the debugging, but i'm not sure where


More information about the Beowulf mailing list