[Beowulf] Correct networking solution for 16-core nodes

Vincent Diepeveen diep at xs4all.nl
Fri Aug 4 08:52:08 PDT 2006


reship of email by accident shipped to Greg instead of mailing list:

Yeah you meant it's 200 usec latency

When 16 cores want all something from the card and that card is serving 16
threads,
then 200 usec is probably the minimum latency when for example 1 long MPI
message
(say of about 200 MB) is arriving, until some other thread receives some
very short message "in between".

What is that time 'in between' for your specific card?

So the time needed to *interrupt* the current long message.

The cruel reality of trying to scale 100% at a network is that you can't
make a special
thread to just nonstop check for a MPI message, like all you guys do for
your pingpong measurements.

If you have 16 cores then you want to run 16 processes at 16 cores, a 17th
thread is already doing time division, a 18th thread is doing
i/o from and to the user. The 19th thread checks for MPI messages from other
nodes regurarly,
then if it's not in the runqueue, the OS already has a wakeup latency to put
a thread in the runqueue of 10 milliseconds.

That is the REAL problem.

You just can't dedicate a special thread to short messages if you want to
use all cores thanks to the runqueue latency.

So the only solution for that is polling from the working processes
themselves.

For non-embarrassingly parallel software that needs to poll for short
messages therefore the time needed
to do a single poll whether a tiny message is there, is very CRUCIAL.

If it's a read from local RAM (local to that processor)  taking 0.13 us then
that is in fact already slowing down the
program a small tad.

Preferably most of such polls happen from the L2 which is a cycle or 13.

It's quite interesting to know which card/implementation has the fastest
poll time here for processes that regurarly poll for short messages,
that includes overhead to check for overflow of the given protocol.

If that's 0.5 us because you have to check for all kind of MPI overflow then
that sucks a lot. Such a card i throw away directly.

Vincent

----- Original Message ----- 
From: "Greg Lindahl" <greg.lindahl at qlogic.com>
To: "Joachim Worringen" <joachim at dolphinics.com>; <beowulf at beowulf.org>
Sent: Thursday, August 03, 2006 10:07 PM
Subject: Re: [Beowulf] Correct networking solution for 16-core nodes


> On Thu, Aug 03, 2006 at 12:53:40PM -0700, Greg Lindahl wrote:
>
>> We have clearly stated that the Mellanox switch is around 200 usec per
>> hop.  Myricom's number is also well known.
>
> Er, 200 micro seconds. Y'all know what I meant, right? :-)
>
> -- greg
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>




More information about the Beowulf mailing list