[Beowulf] Re: Re: Home beowulf - NIC latencies
Patrick Geoffray
patrick at myri.com
Wed Feb 16 02:07:27 PST 2005
Joachim Worringen wrote:
> AFAIK, Myrinet's MPI (MPICH-GM), for example, does use the standard
> (partly naive) collective operations of MPICH. Considering this, plus
> the fact
Replacing the collectives from MPICH-1 was not high on the todo list
because there was more important things to optimize, with more effects
on applications that the scheduling of some collectives. For scaling
real codes on large machines, your priority is not there, not enough
bang for your time.
> - that it's not all that hard to use GM for pt-2-pt efficiently. We have
> done this in our MPI, too, with the same level of performance.
You have then no idea how hard if to use GM efficiently and *correctly*.
Enough to run pingpong ? sure, that's piece of cake. But how to recover
from fatal errors on the wire, from resources exhaustion, to avoid to
spend most of your time pinning/unpinning pages, to not trash the
translation cache on the NIC, etc ? Did you address all of these issues
in your MPI ? Maybe, but it requires some design characteristics that
would be higher than the device layer. At one time you have to make
choices, and in a Swiss-Army-Knive (SAK) implementation, you choose the
common ground, or the existing ground.
> - that you probably do not know anything on ScaMPI's current internal
True, I know zip about ScaMPI design. This is exactely why I don't know
how they use GM. Without knowing that, how can you infer hardware
characteristics from benchmark results ?!?
> design (Intel is MPICH2 plus some Intel-propietary device hacking) and
> little about it's performance (if this is wrong, let us know)
Intel MPI is MPICH2 plus some multi-device glue. Intel got something
right in their design: they ask the vendor to provide the native device
layers instead of doing everything themselves. That's how a (SAK)
implementation could actually be decent. However, the reference
implementation is using uDapl. That means that there is stuff above the
device layers that are needed to make the MPI-over-uDapl performance
decent. Some of it can be used for other devices, the rest not. The
question is that if I need something above the device layer to make my
stuff decent, could I have it ? I would think so. Now, if it conflicts
with something needed for another device, what happens ? Someone makes a
choice.
> - that all code apart from the device, and also the device architecture
> of MPICH-GM are more or less 10-year-old swiss-army-knive MPICH code
> (which is not a bad thing per se)
MPICH-1 is not a SAK. You cannot take an MPICH binary and run it on all
of the devices on which MPICH has been ported. You can *compile* it on
multiple targets, but nothing more.
Furthermore, many ch2 things where not used in ch_gm. If you look at it,
most of the common code of MPICH is not performance related, at the
exception of the collectives (and again they are not that bad). MPICH-2
has been moving more things to the device-specific part, that's the good
direction.
> you should maybe think again before judging on the efficiency of other
> MPI implementations.
I could not care less about the efficiency of other MPI implementations.
None of my business. My point is that assuming that using a SAK MPI
implementation factorize the software part and all remaining performance
differences are thus hardware related is ridiculous. As Greg pointed
out, an interconnect is a software/hardware stack, all the way to the
MPI lib. Throw away the native MPI lib and you have a lame duck. Compare
lame ducks and you go nowhere.
You don't have much choice when you have a commercial MPI than to
support many interconnects. You cannot ask the vendors to write their
part unless you are Intel, so you write it yourself. You do your best,
because you need to sell your stuff, and you call it good. Is there a
value ? Today yes, because it makes life easier to have binary
compatibility. However, my second point is that binary compatibility
should be addressed by the MPI community, not by commercial MPI
implementations.
Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
More information about the Beowulf
mailing list