[Beowulf] Questions regarding interconnects

Fri Mar 25 06:04:53 PST 2005

I feel very important to look at is 'shmem' capabilities. 

That avoids so much problems.

To give a simple example, if i want to modify a searching node then:

In MPI you ship a nonblocking message from node A to B.

In order for B to receive, it has to have either a special thread 
that regurarly polls. If you have a thread that polls say each 10
milliseconds, then what's the use of using a highend network 
card (other than it's DMA capabilities)?

If you *poll* within the searching thread (that eats all system time)
sometimes for the MPI, then that's the best solution.

However, it's very expensive to poll. 

Perhaps someone can calculate for me exactly how many cpu cycles i would
lose at say a 2.2Ghz processor.

On the other hand using the 'shmem', what happens is that A ships a
nonblocking write to B of just a few bytes. The network card in B simply
writes it in the RAM.

Now and then the searching process at B only has to poll its own main
memory to see whether it has a '1'. So sometimes you lose a TLB trashing
call to it, but other times it comes from L2 cache.

A TLB trashing call is even at old chipsets just 400ns and at dual opteron
around 133ns. 

A L2 cache lookup is 20 cycles in case of k7 and 13 cycles in case of opteron.
That last case is roughly 5-6 nanoseconds.

So for short messages which are latency sensitive that 'shmem' of quadrics
is just far superior.

Do other cards implement something similar?

As far as i know they do not.

The overhead of the MPI implementation layer *receiving* bytes is just so
so huge. A cards theoretic one-way pingpong latency is just irrelevant to
that, because that one way pingpong programs at all cards is eating 100%
system time, effectively losing a full cpu.

If you lose a full cpu, the efficiency of your software degrades incredible.

In fact at nodes with 1 cpu you hardly have left system time.

So real important is measuring the effective wall clock time you lose
before you receive the message, without hurting the main processor too much.

Additional also measurement is needed how many main processing time you
lose to the network.

In the end what matters is how quickly the main processor gets the job for
its process and how much system time this main process can use from the
main processor(s). 

The network is just a tool to deliver data from A to B and should not get
in the way of the real job to be done.

Vincent

At 10:41 AM 3/22/2005 -0600, Richard Walsh wrote:
>Olli-Pekka Lehto wrote:
>
>> Hello,
>>
>> I'm writing a paper on current and emerging cluster interconnect 
>> technologies as a part of my University studies. I have included 1GbE 
>> (incl. RDMA), 10GbE, Quadrics, InfiniBand and Myrinet. The goal is to 
>> provide an introduction to the subject maybe more from a network 
>> engineer's point of view with an overview on the key features and the 
>> pros/cons of each solution. I have some questions on which I hope you 
>> could help me out with:
>
>    I think that integrating a custom interconnect for comparison into 
>you analysis would be useful to contrast
>    the capabilities of "commodity cluster" interconnects with those of 
>the presumptive custom leading edge.
>    I would choose the Cray X1e or Altix interconnects for this.
>
>>
>> What do you see as the key differentiating factors in the quality of 
>> an MPI implementation? This far I have come up with the following:
>> -Completeness of the implementation
>> -Latency/bandwidth
>> -Asynchronous communication
>> -Smart collective communication
>
>    I think that explicit treatment/comparison of the interconnect's 
>RDMA capabilities is important as they support
>    both MPI-2 and the new-ish UPC and CAF compilers for cluster 
>systems.  I can send you a recent article I wrote
>    comparing Quadrics to the Cray X1 interconnect relative to the 
>performance of these global address space programming
>    models (UPC and CAF).
>
>    Another thing to look at is the latency advantage/potential of 
>alternative paths to the processor (i.e HT/Infinipath)
>
>>
>> Are there any NICs on the market which utilize the 10GBase-CX4 
>> standard and if there is are there any clusters which use them? Do you 
>> see it as a viable choice for an interconnect considering the 
>> relatively low cost of InfiniBand and that fact that 10GBase-T is not 
>> that far in the future?
>>
>> When do you estimate that commodity Gigabit NICs with integrated RDMA 
>> support will arrive to the market? (or will they?)
>
>
>   AMASSO already sells one.
>
>>
>>
>> best regards,
>> Olli-Pekka
>
>
>
>Richard Walsh
>AHPCRC
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>