Fw: [Beowulf] Correct networking solution for 16-core nodes

Fri Aug 4 08:36:08 PDT 2006

So for the replacement personnel of Greg,
I understand that your cards can't interrupt at all.
Users just have to wait until other messages have past the wire,
before receiving a very short message (that for example aborts the entire 
job).

In short if some other person on the cluster is streaming a bit at some 
nodes,
then you have a major latency problem.

Vincent

>From : Greg Lindahl
To: Vincent Diepeveen
On Fri, Aug 04, 2006 at 11:19:42AM +0100, Vincent Diepeveen wrote:

> What is that time 'in between' for your specific card?

Zero. That's the whole point of the message-rate benchmark, and a
unique aspect of the interconnect that I designed.

Now please stop emailing me personally, you know I find you extremely
annoying.

> So the time needed to *interrupt* the current long message.

Our interconnect uses no interrupts.

> The cruel reality of trying to scale 100% at a network is that you can't 
> make a special
> thread to just nonstop check for a MPI message, like all you guys do for 
> your pingpong measurements.

We do not ever use a special thread.

> That is the REAL problem.

The real problem is that you do not understand that you don't know
everything.

Now, as I said earlier, never email me personally.

-- greg

----- Original Message ----- 
From: "Vincent Diepeveen" <diep at xs4all.nl>
To: "Greg Lindahl" <greg.lindahl at qlogic.com>
Sent: Friday, August 04, 2006 11:19 AM
Subject: Re: [Beowulf] Correct networking solution for 16-core nodes

> Yeah you meant it's 200 usec latency
>
> When 16 cores want all something from the card and that card is serving 16 
> threads,
> then 200 usec is probably the minimum latency when for example 1 long MPI 
> message
> (say of about 200 MB) is arriving, until some other thread receives some 
> very short message "in between".
>
> What is that time 'in between' for your specific card?
>
> So the time needed to *interrupt* the current long message.
>
> The cruel reality of trying to scale 100% at a network is that you can't 
> make a special
> thread to just nonstop check for a MPI message, like all you guys do for 
> your pingpong measurements.
>
> If you have 16 cores then you want to run 16 processes at 16 cores, a 17th 
> thread is already doing time division, a 18th thread is doing
> i/o from and to the user. The 19th thread checks for MPI messages from 
> other nodes regurarly,
> then if it's not in the runqueue, the OS already has a wakeup latency to 
> put a thread in the runqueue of 10 milliseconds.
>
> That is the REAL problem.
>
> You just can't dedicate a special thread to short messages if you want to 
> use all cores thanks to the runqueue latency.
>
> So the only solution for that is polling from the working processes 
> themselves.
>
> For non-embarrassingly parallel software that needs to poll for short 
> messages therefore the time needed
> to do a single poll whether a tiny message is there, is very CRUCIAL.
>
> If it's a read from local RAM (local to that processor)  taking 0.13 us 
> then that is in fact already slowing down the
> program a small tad.
>
> Preferably most of such polls happen from the L2 which is a cycle or 13.
>
> It's quite interesting to know which card/implementation has the fastest 
> poll time here for processes that regurarly poll for short messages,
> that includes overhead to check for overflow of the given protocol.
>
> If that's 0.5 us because you have to check for all kind of MPI overflow 
> then that sucks a lot. Such a card i throw away directly.
>
> Vincent
>
> ----- Original Message ----- 
> From: "Greg Lindahl" <greg.lindahl at qlogic.com>
> To: "Joachim Worringen" <joachim at dolphinics.com>; <beowulf at beowulf.org>
> Sent: Thursday, August 03, 2006 10:07 PM
> Subject: Re: [Beowulf] Correct networking solution for 16-core nodes
>
>
>> On Thu, Aug 03, 2006 at 12:53:40PM -0700, Greg Lindahl wrote:
>>
>>> We have clearly stated that the Mellanox switch is around 200 usec per
>>> hop.  Myricom's number is also well known.
>>
>> Er, 200 micro seconds. Y'all know what I meant, right? :-)
>>
>> -- greg
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit 
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>