[Beowulf] IB problem with openmpi 1.2.8
Bill Wichser
bill at Princeton.EDU
Tue Jul 13 13:09:20 PDT 2010
Just some more info. Went back to the prior kernel with no luck.
Updated the firmware on the Topspin HBA cards to the latest (final)
version (fw-25208-4_8_200-MHEL-CF128-T). Nothing changes. Still not
sure where to look.
Bill Wichser wrote:
> Machine is an older Intel Woodcrest cluster with a two tiered IB
> infrastructure with Topspin/Cisco 7000 switches. The core switch is a
> SFS-7008P with a single management module which runs the SM manager.
> The cluster runs RHEL4 and was upgraded last week to kernel
> 2.6.9-89.0.26.ELsmp. The openib-1.4 remained the same. Pretty much
> stock.
>
> After rebooting, the IB cards in the nodes remained in the INIT
> state. I rebooted the chassis IB switch as it appeared that no SM was
> running. No help. I manually started an opensm on a compute node
> telling it to ignore other masters as initially it would only come up
> in STANDBY. This turned all the nodes' IB ports to active and I
> thought that I was done.
>
> ibdiagnet complained that there were two masters. So I killed the
> opensm and now it was happy. osmtest -f c/osmtest -f a comes back
> with OSMTEST: TEST "All Validations" PASS.
> ibdiagnet -ls 2.5 -lw 4x finds all my switches and nodes with
> everything coming up roses.
>
> The problem is that openmpi 1.2.8 with Intel 11.1.074 fails when the
> node count goes over 32 (or maybe 40). This worked fine in the past,
> before the reboot. User apps are failing as well as IMB v3.2. I've
> increased the timeout using the "mpiexec -mca btl_openib_ib_timeout
> 20" which helped for 48 nodes but when increasing to 64 and 128 it
> didn't help at all. Typical error message follow.
>
> Right now I am stuck. I'm not sure what or where the problem might
> be. Nor where to go next. If anyone has a clue, I'd appreciate
> hearing it!
>
> Thanks,
> Bill
>
>
> typical error messages
>
> [0,1,33][btl_openib_component.c:1371:btl_openib_component_progress]
> from woodhen-050 to: woodhen-036 error polling HP CQ with status RETRY
> EXCEEDED ERROR status number 12 for wr_id 182937592248 opcode 0
> [0,1,36][btl_openib_component.c:1371:btl_openib_component_progress]
> from woodhen-084 to: woodhen-085 error polling HP CQ with status RETRY
> EXCEEDED ERROR status number 12 for wr_id 5840952 opcode 0
> [0,1,40][btl_openib_component.c:1371:btl_openib_component_progress]
> from woodhen-098 to: woodhen-096 error polling LP CQ with status RETRY
> EXCEEDED ERROR status number 12 for wr_id 182947573944 opcode 0
> --------------------------------------------------------------------------
>
> The InfiniBand retry count between two MPI processes has been
> exceeded. "Retry count" is defined in the InfiniBand spec 1.2
> (section 12.7.38):
>
> The total number of times that the sender wishes the receiver to
> retry timeout, packet sequence, etc. errors before posting a
> completion error.
>
> This error typically means that there is something awry within the
> InfiniBand fabric itself. You should note the hosts on which this
> error has occurred; it has been observed that rebooting or removing a
> particular host from the job can sometimes resolve this issue.
>
> Two MCA parameters can be used to control Open MPI's behavior with
> respect to the retry count:
>
> * btl_openib_ib_retry_count - The number of times the sender will
> attempt to retry (defaulted to 7, the maximum value).
>
> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
> to 10). The actual timeout value used is calculated as:
>
> 4.096 microseconds * (2^btl_openib_ib_timeout)
>
> See the InfiniBand spec 1.2 (section 12.7.34) for more details.
> --------------------------------------------------------------------------
>
> --------------------------------------------------------------------------
>
>
> DIFFERENT RUN:
>
> [0,1,92][btl_openib_component.c:1371:btl_openib_component_progress]
> from woodhen-157 to: woodhen-081 error polling HP CQ with status RETRY
> EXCEEDED ERROR status number 12 for wr_id 183541169080 opcode 0
> ...
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list