[Beowulf] Questions regarding interconnects

Fri Mar 25 14:39:15 PST 2005

At 04:02 PM 3/25/2005 -0500, Patrick Geoffray wrote:
>Hi Vincent,
>
>Vincent Diepeveen wrote:
>> I feel very important to look at is 'shmem' capabilities. 
>
>> In order for B to receive, it has to have either a special thread 
>> that regurarly polls. If you have a thread that polls say each 10
>> milliseconds, then what's the use of using a highend network 
>> card (other than it's DMA capabilities)?
>
>You are in a situation where you don't have to wait for the message to 
>arrive, you can move on and check 10 ms later. In this case, you don't 
>care about network speed.
>
>> However, it's very expensive to poll.

>No, it's not. No in the OS-bypass world.

Your definition of cheap is what is defined 'expensive' in mine.

>> On the other hand using the 'shmem', what happens is that A ships a
>> nonblocking write to B of just a few bytes. The network card in B simply
>> writes it in the RAM.
> >
>> Now and then the searching process at B only has to poll its own main
>> memory to see whether it has a '1'. So sometimes you lose a TLB trashing
>> call to it, but other times it comes from L2 cache.

>It's still polling. With message passing, you actually poll a queue in 
>the MPI lib instead of a specific location in the user application. That 
>helps when you are looking for several messages from several sources 
>(got to poll several locations in you model).

The only way to avoid polling is by accepting death penalty runqueue
latencies or waste a full cpu.

Message passing MPI has the queue overrun problem. You must do several
calls which makes it all together dead slow.

I do not know about your software, but for mine losing time to TLB trashing
will *considerably* slow it down. So every slow memory call i try to save out.

Yet it's trivial that you can't avoid wasting regurarly 1 when polling.

I'm prepared to lose *that* poll time to a single memory location.

>> So for short messages which are latency sensitive that 'shmem' of quadrics
>> is just far superior.

>You are getting confused with words. "SHMEM" is a legacy shared memory 
>interface that was used on Cray machines like the T3D. It's not a 
>standard per se, it's a software interface. The implementations usually 
>rest on top of remote memory operations (PUT/GET).

You are correct of course here, my historic knowledge regarding crays is
real little.

Only ran on a few supercomputers, Cray T3D recently wasn't one of them.

>It always stike mean when people put "one-sided" and "latency sensitive" 
>in the same sentence. "one-sided" means that you don't want to involve 
>the remote side in the communication and "latency sensitive" means the 
>other side is waiting for the communication.
>In your example, you will be looking if someone has written in your 
>memory every X ms. In this case, what do you care about latency ?

That's what my problem is with MPI. 

Majority of researchers using MPI will be checking that MPI queue so much,
in order to not slow down, that the program as a whole slows down. 

If they avoid it, the only advantage to using MPI is the huge bandwidth a
card delivers.

When cheap 10 gbit cards arrive, of course also *that* advantage has gone,
and MPI has really little benefits.

Calling it portable is not a good argument IMHO, because it is so hard to
get system time at big supercomputers/superclusters that you can work
fulltime to get something to run at it anyway, as the work you have to do
on paper to get the system time is already half a manyear. 

>> Do other cards implement something similar?

>You can do PUT on most high speed networks, this is a pretty basic 
>functionality. The SHMEM interface may not be used because it makes 
>sense only for former Cray customers, but look for portable RMA 
>implementations like ARMCI for example.

I'm pretty sure quadrics still offers the shmem functionality to its users. 

See prof Aad v/d Steen's "Supercomputers in Europe" report for dutch
government (NWO, NCF).

>> As far as i know they do not.
>
>Do more research.

>> The overhead of the MPI implementation layer *receiving* bytes is just so
>> so huge. A cards theoretic one-way pingpong latency is just irrelevant to
>> that, because that one way pingpong programs at all cards is eating 100%
>> system time, effectively losing a full cpu.
>
>You are mistaken about the MPI receive overhead. You are also mistaken 

I'm not mistaken this is clear, otherwise you would show up with actual
number of memory calls needed for the MPI overhead.

What i do know is that you always first must avoid buffer overrun in MPI.
That means for example:

  MPI_Isend(....)
  MPI_Test(&Reg,&flg,&Stat)
  while(!flg) {
    Myprogram_MsgPending();  // Important, read in messages and process them
         while waiting on complete. Otherwise the own Input-Buffer can
overflow
                                         // and we get a deadlock.
    MPI_Test(&Reg,&flg,&Stat);
  }

So that's 4 function calls where there should be 1, just for sending/polling.

That means i'm gonna lose 3 unnecessary calls, where i just want to lose 1.

All i want to do is get the job modified of a remote running processor as
fast as possible. So i want to write in one of its arrays say 4-64 bytes at
most.

As soon as the remote processor polls during its search then it can use the
new parameters and continue.

In search this gives an exponential speedup when using
  YBW (youngest brother wait) + alfabeta + nullmove 
  + shared transpositiontables (hashtables)

So it is crucial to inform other processors as soon as possible.

>in your belief than one-sided operations are the Silver bullets. RMA 
>operations may be more appropriate to an application design, but it 
>shares many constraints with message passing: you have to poll to know 
>when it's done, you have to tell the other side where to write 
>(equivalent to posting a recv). It has drawbacks like usually not 
>scaling in space (each sender should write to a different location).

The silver bullet is that an array in the remote processor receives the
data, without this processor nor the remote one needing to do 6 function
calls first.

The local processor shipping the data just wants 1 nonblocking function
call, the remote processor just wants to now and then check a single
variable whether there is data.

Message passing seems to me too slow for that. Just the function calls
already look like a big barrier and don't make programming simpler:

  MPI_Isend(....)
  MPI_Test(&Reg,&flg,&Stat)
  while(!flg) {
    Myprogram_MsgPending();  // Important, read in messages and process them
         while waiting on complete. Otherwise the own Input-Buffer can
overflow
                                         // and we get a deadlock.
    MPI_Test(&Reg,&flg,&Stat);
  }

Vincent

>Patrick
>-- 
>
>Patrick Geoffray
>Myricom, Inc.
>http://www.myri.com
>
>