[Beowulf] mpi slow pairs
Lawrence Stewart
stewart at serissa.com
Sun Aug 31 12:45:22 PDT 2014
I believe in this context, the parking lot problem refers to the problem of
cars leaving a parking lot via one exit, with a tree of merge points before the exit.
If each merge is "fair" then a particular flow sees a bandwidth of 1/(2**n) where
n is the number of merge points to the exit.
(Try getting off the roof deck of the Boston Science Museum parking garage at closing
time!)
This is talked about in Dally and Towles
Principles of Interconnection Networks, my copy of which is at the office...
This is an effect that only happens when there are a lot of flows, and there is
congestion in the network somewhere.
In IB, if I understand IB correctly, which is unlikely, congestion happens if you
have, for example, a fat-tree which is not non-blocking (1/2 or 1/4 non-blocking)
or if the sum of the flows <to a particular node> exceeds that nodes input link. In
these circumstances, flows which have fewer hops will get more bandwidth than flows
which have more. In addition, flows which happen to use links congested by these slow
flows will also become slow, for example due to head-of-line blocking or full switch buffers.
None of these effects should be visible in a pairwise bandwidth test, which would only
have one flow at a time in the network. Instead, the pairwise test ought to reveal
slow pairs that cross, day, links with high error rates or bad switch ports.
Such testing might give confusing results if the IB network is set up for dynamic routing,
which might change flows to avoid slow links (not sure how this works in IB, but maybe
it could be turned off.)
Getting back to the original question, I'm not aware of such an MPI test, but if one isn't
laying around in the Ohio State corpus or via Intel/Pathscale..., it shouldn't be hard to write one.
I wasn't able to find a good Internet accessible reference of the Parking lot problem, but
it is mentioned in First Experiences with Congestion Control in InfiniBand Hardware (Gran
et all, 2010) The citation here is to Dally.
-L
On 2014, Aug 29, at 4:20 PM, Michael Di Domenico <mdidomenico4 at gmail.com> wrote:
> On Fri, Aug 29, 2014 at 3:26 PM, Håkon Bugge <h-bugge at online.no> wrote:
>> Hmm, are all pairs going through the spine? If not, look up the parking-lot
>> problem. Håkon
>
> i believe all the pairs do pass through a spine. i'm not familiar
> with the "parking-lot problem", i'll google it, but suspect a
> bazillion hits will come back
>
>
>> Sendt fra min HTC
>>
>> ----- Reply message -----
>> Fra: "Michael Di Domenico" <mdidomenico4 at gmail.com>
>> Til: "Beowulf Mailing List" <Beowulf at beowulf.org>
>> Emne: [Beowulf] mpi slow pairs
>> Dato: fre., aug. 29, 2014 18:09
>>
>>
>>
>> On Fri, Aug 29, 2014 at 11:38 AM, John Hearns <John.Hearns at viglen.co.uk>
>> wrote:
>>>> Also have you run ibdiagnet to see if anything is flagged up?
>>>
>>> i've run a multitude of ib diags on the machines, but nothing is popping
>>> out as wrong. what's weird is that it's only certain pairing of machines
>>> not any one machine in general.
>>>
>>> Would that then be a problem in one of the blades or a part of the switch?
>>
>> not sure yet, i think on the spine modules in the switch is silently
>> failing to send traffic a full speed, but i've not been able to
>> "prove" this yet.
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list