[Beowulf] Performance characterising a HPC application

Thu Mar 29 08:08:39 PDT 2007

On Thu, 29 Mar 2007, Richard Walsh wrote:

> Hey Patrick,
> 
> Patrick Geoffray wrote:
> > Message aggregation would be much more beneficial in the context of
> > UPC, where the compiler will likely generates many small grain
> > communications to the same remote process. However, as Greg pointed
> > out, MPI applications often already aggregate messages themselves
> > (this is something even "challenged" application people understand
> > easily). 
> Right ...
> > I would bet that UPC could more efficiently leverage a strided or
> > vector communication primitive instead of message aggregation. I don't
> > know if GasNet provides one, I know ARMCI does. 
> Having the alternative of compiling to a pseudo-vector pipelined
> GASnet/ARMCI primitive (hiding latency in the pipeline) >or< an
> aggregation primitive (amortizing latency over a large data block) would
> seem to be a good thing depending on the number and distribution of
> remote memory references in your given kernel.  The latency bubbles in
> interconnect- mediated remote memory references are of course much
> larger than in a hardware mediated global addresses space remote
> reference.  This might change the effectiveness of pipelined based
> latency hiding.   From what I know the Berkeley UPC
> compiler team is focused on optimization through aggregation rather than
> pipelining.  

Hi Richard,

ARMCI, by way of Global Arrays, makes it pretty clear to
performance-minded end users that vector pipelining is the
optimization that they leverage by exposing the dimensionality and
size of in-memory references -- that makes the model more explicit
than UPC, but much less so than MPI.  With UPC, as I'm sure you know,
shared pointers contain some information about data layout but that
information can be lost depending on how the shared pointer is
manipulated, and can end up being something the compiler has to (or
can't) figure out.  The fact that UPC has a relaxed shared memory
model helps in any prefetching, caching, aggregation or any other
latency-hiding technique so vital on systems with orders of magnitude
latency differences between local and global memory.  It's a
difficult optimization problem since some are compile-time and others
are dependent on the runtime, and optimizations on both sides
shouldn't step on each other's toes.

For MPI, if it is to be "the assembly language of parallel
computing", it is harder to justify the use of implicit message
aggregation (Just to be clear, by assembly here I think the author of
this quote meant the "micro-ops" in your favorite processor, not the
whole risc-vs-cisc-that-is-really-a-risc debate).

For having developed parallel applications using both explicit and
implicit programming models, I find that one of the most useful
things to know in both is when and how communication happens.
"Communication leakage", or unintended communication is what makes
this task more difficult on implicit languages.  The "how" part is
easier as a performance-oriented task on MPI, whereas the "when" is
mostly left as an optimization for the MPI implementation to figure
out (although the programmer has additional primitives to better
control the "when" with synchronous primitives and blocking calls).

I would find it surprising that existing MPI codes would really
benefit from aggregation since users have had to be mindful of
understanding where/how communication happens as a performance
concern for decades now.  Also, there's the longstanding unwritten
MPI rule that "larger messages are better" that weakens the case for
message aggregation as a latency-hiding technique.  Regardless, I can
accept the case for message aggregation in MPI but it shouldn't be a
de facto component of an MPI implementation -- it should be a on/off
switch so developers and it shouldn't be on by default in
benchmarking modes (then again, measuring latency with small message
algorithms that only scale up to 16 nodes shouldn't be on by default
for benchmarking modes either, but that's a different issue).

MPI implementations are horrible beasts to maintain but are beautiful
in some regards, they flourish in many directions and I wouldn't
stand in the way of yet another performance-minded feature.  But if
MPI is to remain the reference explicit model, implementations should
be explicit about what's going on under the covers when a
programmer's expectation of *how* 2 ranks communicate is seriously
affected -- I think aggregation in this case qualifies for serious.

>             Regardless, at this point our more GUPS-like direct remote
> memory reference patterns in our UPC codes, which perform well on the
> Cray X1E, must be manually aggregated to achieve performance on a cluster.

I also butted my head pretty hard against this problem.  For UPC,
part of it has to do with 'C' and how difficult it is to assert that
loops are free of dependencies (typical aliasing problems in C even
serially).  From the "runtime systems" level, the X1E's lack of
flexible inline assembly made it difficult to construct (c.f. name)
the scatter/gather ops that are so vital to its performance.  But
then again, one could say that massaging codes to be vector friendly
has been part of the bargain on vectors for decades now.

    . . christian

-- 
christian.bell at qlogic.com
(QLogic SIG, formerly Pathscale)