[Beowulf] Re: Re: Home beowulf - NIC latencies

Patrick Geoffray patrick at myri.com
Wed Feb 16 02:07:27 PST 2005

Joachim Worringen wrote:
> AFAIK, Myrinet's MPI (MPICH-GM), for example, does use the standard 
> (partly naive) collective operations of MPICH. Considering this, plus 
> the fact

Replacing the collectives from MPICH-1 was not high on the todo list 
because there was more important things to optimize, with more effects 
on applications that the scheduling of some collectives. For scaling 
real codes on large machines, your priority is not there, not enough 
bang for your time.

> - that it's not all that hard to use GM for pt-2-pt efficiently. We have 
> done this in our MPI, too, with the same level of performance.

You have then no idea how hard if to use GM efficiently and *correctly*. 
Enough to run pingpong ? sure, that's piece of cake. But how to recover 
from fatal errors on the wire, from resources exhaustion, to avoid to 
spend most of your time pinning/unpinning pages, to not trash the 
translation cache on the NIC, etc ? Did you address all of these issues 
in your MPI ? Maybe, but it requires some design characteristics that 
would be higher than the device layer. At one time you have to make 
choices, and in a Swiss-Army-Knive (SAK) implementation, you choose the 
common ground, or the existing ground.

> - that you probably do not know anything on ScaMPI's current internal 

True, I know zip about ScaMPI design. This is exactely why I don't know 
how they use GM. Without knowing that, how can you infer hardware 
characteristics from benchmark results ?!?

> design (Intel is MPICH2 plus some Intel-propietary device hacking) and 
> little about it's performance (if this is wrong, let us know)

Intel MPI is MPICH2 plus some multi-device glue. Intel got something 
right in their design: they ask the vendor to provide the native device 
layers instead of doing everything themselves. That's how a (SAK) 
implementation could actually be decent. However, the reference 
implementation is using uDapl. That means that there is stuff above the 
device layers that are needed to make the MPI-over-uDapl performance 
decent. Some of it can be used for other devices, the rest not. The 
question is that if I need something above the device layer to make my 
stuff decent, could I have it ? I would think so. Now, if it conflicts 
with something needed for another device, what happens ? Someone makes a 

> - that all code apart from the device, and also the device architecture 
>  of MPICH-GM are more or less 10-year-old swiss-army-knive MPICH code 
> (which is not a bad thing per se)

MPICH-1 is not a SAK. You cannot take an MPICH binary and run it on all 
of the devices on which MPICH has been ported. You can *compile* it on 
multiple targets, but nothing more.

Furthermore, many ch2 things where not used in ch_gm. If you look at it, 
most of the common code of MPICH is not performance related, at the 
exception of the collectives (and again they are not that bad). MPICH-2 
has been moving more things to the device-specific part, that's the good 

> you should maybe think again before judging on the efficiency of other 
> MPI implementations.

I could not care less about the efficiency of other MPI implementations. 
  None of my business. My point is that assuming that using a SAK MPI 
implementation factorize the software part and all remaining performance 
differences are thus hardware related is ridiculous. As Greg pointed 
out, an interconnect is a software/hardware stack, all the way to the 
MPI lib. Throw away the native MPI lib and you have a lame duck. Compare 
lame ducks and you go nowhere.

You don't have much choice when you have a commercial MPI than to 
support many interconnects. You cannot ask the vendors to write their 
part unless you are Intel, so you write it yourself. You do your best, 
because you need to sell your stuff, and you call it good. Is there a 
value ? Today yes, because it makes life easier to have binary 
compatibility. However, my second point is that binary compatibility 
should be addressed by the MPI community, not by commercial MPI 


Patrick Geoffray
Myricom, Inc.

More information about the Beowulf mailing list