[Beowulf] New HPCC results, and an MX question

Wed Jul 20 18:49:09 PDT 2005

Vincent Diepeveen wrote:
>>>There likely will be a difference, because average pingpong doesn't
>>>run on all the cpus. On a 4-cpu node, that can make a big difference.
>>
>>I believe the difference will not be that big. I will get my hands on a 
>>quad in the next couple of weeks, I will look into int.
> 
> 
> The difference will be huge of course, network processors have a switch
> latency. That's why.
> 
> If it must switch at the wrong moment that'll cost 50 us or something at
> certain network chips.

Switch latency is negligable in this problem, and in any event 50us is 
not a realistic switch latency with modern hardware.

The real question is the following: does 4 processes running on 4 
different CPUs affect greatly the latency when sending small messages to 
other nodes compared to only one process running on one CPU ?

The answer, I argue, is "not much". Assuming that all processes sends at 
the exact same time, access to the PCI bus will be serialized, NIC 
processing will be serialized and access to the wire will be serialized. 
  The most expensive resource in this pipeline for 0-byte messages is 
likely to be the NIC. So, it boils down to the NIC overhead per send (or 
recv) and that is not big with MX (and will be further reduce in the 
future). In any event, not in the order of 10us. With GM, it's a 
different story as it does not do PIO for small messages.

> Additional there will be software layers that have to lock in some way.

You don't have to lock when doing os-bypass. At least, you don't have to 
lock with other processes (which is kinda expensive). We take a spinlock 
because we have at least another thread in the lib. The gain of having 
such a thread outweight the cost of the spinlock, no questions about that.

> Locking +  unlocking is already like half a microsecond extra, just like that.

Taking a spinlock on Opteron is ~50 us. On Xeon or Nocona, it's a bit 
more (~150ns).

> Tests at all processors at the same time make major sense.

Yes and no. Most networking people believe the job of a node is to send 
messages. Actually, it's mainly to compute, and sometimes sends 
messages. So, would running a pingpong test on multiple processors at 
the same time sharing a NIC an interesting benchmark ? Not really, it 
won't happen much on real codes that compute most of the time. I prefer 
to optimize other things that help the host compute faster.

> Any denial in advance that it will be the same speed is just ballony.

And I thought I was the bulliest on this list...

I just give my opinion and at least my opinion is backed up by 
first-hand experience. I don't know how to play chess, but I know my stuff.

Patrick
-- 

Patrick Geoffray
Myricom, Inc.
http://www.myri.com