Fw: [Beowulf] Correct networking solution for 16-core nodes
Vincent Diepeveen
diep at xs4all.nl
Fri Aug 4 08:36:08 PDT 2006
So for the replacement personnel of Greg,
I understand that your cards can't interrupt at all.
Users just have to wait until other messages have past the wire,
before receiving a very short message (that for example aborts the entire
job).
In short if some other person on the cluster is streaming a bit at some
nodes,
then you have a major latency problem.
Vincent
>From : Greg Lindahl
To: Vincent Diepeveen
On Fri, Aug 04, 2006 at 11:19:42AM +0100, Vincent Diepeveen wrote:
> What is that time 'in between' for your specific card?
Zero. That's the whole point of the message-rate benchmark, and a
unique aspect of the interconnect that I designed.
Now please stop emailing me personally, you know I find you extremely
annoying.
> So the time needed to *interrupt* the current long message.
Our interconnect uses no interrupts.
> The cruel reality of trying to scale 100% at a network is that you can't
> make a special
> thread to just nonstop check for a MPI message, like all you guys do for
> your pingpong measurements.
We do not ever use a special thread.
> That is the REAL problem.
The real problem is that you do not understand that you don't know
everything.
Now, as I said earlier, never email me personally.
-- greg
----- Original Message -----
From: "Vincent Diepeveen" <diep at xs4all.nl>
To: "Greg Lindahl" <greg.lindahl at qlogic.com>
Sent: Friday, August 04, 2006 11:19 AM
Subject: Re: [Beowulf] Correct networking solution for 16-core nodes
> Yeah you meant it's 200 usec latency
>
> When 16 cores want all something from the card and that card is serving 16
> threads,
> then 200 usec is probably the minimum latency when for example 1 long MPI
> message
> (say of about 200 MB) is arriving, until some other thread receives some
> very short message "in between".
>
> What is that time 'in between' for your specific card?
>
> So the time needed to *interrupt* the current long message.
>
> The cruel reality of trying to scale 100% at a network is that you can't
> make a special
> thread to just nonstop check for a MPI message, like all you guys do for
> your pingpong measurements.
>
> If you have 16 cores then you want to run 16 processes at 16 cores, a 17th
> thread is already doing time division, a 18th thread is doing
> i/o from and to the user. The 19th thread checks for MPI messages from
> other nodes regurarly,
> then if it's not in the runqueue, the OS already has a wakeup latency to
> put a thread in the runqueue of 10 milliseconds.
>
> That is the REAL problem.
>
> You just can't dedicate a special thread to short messages if you want to
> use all cores thanks to the runqueue latency.
>
> So the only solution for that is polling from the working processes
> themselves.
>
> For non-embarrassingly parallel software that needs to poll for short
> messages therefore the time needed
> to do a single poll whether a tiny message is there, is very CRUCIAL.
>
> If it's a read from local RAM (local to that processor) taking 0.13 us
> then that is in fact already slowing down the
> program a small tad.
>
> Preferably most of such polls happen from the L2 which is a cycle or 13.
>
> It's quite interesting to know which card/implementation has the fastest
> poll time here for processes that regurarly poll for short messages,
> that includes overhead to check for overflow of the given protocol.
>
> If that's 0.5 us because you have to check for all kind of MPI overflow
> then that sucks a lot. Such a card i throw away directly.
>
> Vincent
>
> ----- Original Message -----
> From: "Greg Lindahl" <greg.lindahl at qlogic.com>
> To: "Joachim Worringen" <joachim at dolphinics.com>; <beowulf at beowulf.org>
> Sent: Thursday, August 03, 2006 10:07 PM
> Subject: Re: [Beowulf] Correct networking solution for 16-core nodes
>
>
>> On Thu, Aug 03, 2006 at 12:53:40PM -0700, Greg Lindahl wrote:
>>
>>> We have clearly stated that the Mellanox switch is around 200 usec per
>>> hop. Myricom's number is also well known.
>>
>> Er, 200 micro seconds. Y'all know what I meant, right? :-)
>>
>> -- greg
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>
More information about the Beowulf
mailing list