[Beowulf] Correct networking solution for 16-core nodes

Thu Aug 3 02:19:44 PDT 2006

Tahir Malas wrote:
>> -----Original Message-----
>> From: Vincent Diepeveen [mailto:diep at xs4all.nl]
[...]
>> Quadrics can work for example as direct shared memory among all
>> nodes when you program for its shmem, which means that for short
>> messages you can simply share from its 64MB ram on card something
>> and share that array over all nodes. You just write normal in this array
>> and the cards take care it gets synchronised.
>>
> That is how LAM_MPI handles short messages in a SMP node, isn't it? But we
> don't change anything in the MPI routines, if the message is short, it
> handles it as of shmem.

This comparison is a little bit simplistic. With a shared-memory based 
MPI, it can also happen that data is transfered with one copy from send 
buffer to shared buffer, then another copy to receive buffer - if the 
message is expected. Otherwise, another copy operation to a temporary 
buffer is required. This never happens using the SHMEM API. Also, MPI 
adds message headers and does message matching, which is not required 
for SHMEM.

> For IB of Voltaire for example there are some opportunities. Single port vs
> dual port, MEMfree vs 128-256 MB RAM (which sounds similar to Quadrics), and
> more importantly BUS interface; PCI-X,PCI-E, or AMD Hyper-transport (HTX).
> HTX is said to provide 1.3us latency by connecting directly to the AMD
> Opteron processor via a standard HyperTransport HTX slot. HyperTransport HTX
> slot means RAM slots on the mb? Then we have to sacrifice some slots for
> NIC? Well, at the end it is still unclear to me which one and how many to
> choose. 

No, you use dedicated HTX slots for the NIC. HTX is not found on the 
majority of Opteron mainboards, but a number of HTX server boards does 
exsit.

 From the numbers published by Pathscale, it seems that the simple MPI 
latency of Infinipath is about the same whether you go via PCIe or HTX. 
The application perfomance might be different, though.

>> Other than that latency, you have to realize that still the latency of
>> those
>> cards is
>> ugly compared to the latencies within the quad boxes.
>>
>> If you have 8 threads running or so in those boxes and you use an IB card,
>> then it'll have a switch latency.
>>
>> Only quadrics is clear about its switch latency (probably competitors have
>> a
>> worse
>> one). It's 50 us for 1 card.

Where did you find these numbers? Such a huge delay should be easy to 
measure using a simple MPI benchmark? I.e. Pathscale's "mpi_multibw"?

> But if we directly connect two boxes without a switch, then we can achieve
> this latency I hope?

No, the describedlatency is a node-internal latency.

  Jachim

-- 
Joachim Worringen, Software Architect, Dolphin Interconnect Solutions
phone ++49/(0)228/324 08 17 - http://www.dolphinics.com