[Beowulf] Questions regarding interconnects

Thu Mar 24 21:47:45 PST 2005

> > What do you see as the key differentiating factors in the quality of an 
> > MPI implementation? This far I have come up with the following:
> > -Completeness of the implementation

this really depends on the "maturity" of the application.  I know of one
application which has covered a lot of ground, including cray-shm, openmp
and mpi-2 (with heavy use of post-mpi-1 features.)  it cares about 
completeness, but a new app written from scratch doesn't.

> > -Latency/bandwidth

it would be hard to argue that these don't matter.  as Greg points 
out, zero-byte latency and infinite-byte bandwidth don't necessarily
predict the performance that real-app-sized packets will see.
then again, if a more accurate prediction were desired, just fit
three lines to the s-curve.  that's the appeal of quoting half-bandwidth
packet size anyway, isn't it?

> > -Asynchronous communication

this appeals because people recognize that it *could* provide 
higher performance.  it seems like most implementations are fairly
disappointing in how they implement asynchrony, but that's not a 
reason to ignore asynchrony present in your program.

> > -Smart collective communication

this would appeal more widely if there was hardware support that gave
a real speedup (as in Quadrics) rather than shifting code from app-space
to library-space.  in other words, people care less about convenience 
functions.  libmpi.a may do a wonderful O(nlogn) bcast, but it would 
be a lot sexier if the interconnect provided hardware acceleration.

> Likewise, people want asynchronous communication because they imagine
> that it will give them better performance.

I think there's more to it than that.  any programmer notices when 
there are dependencies and when there is slack.  if there was a smart
MPI/interconnect coprocessor, taking advantage of the slack would 
turn asynchrony into better performance - basically latency hiding.

> > When do you estimate that commodity Gigabit NICs with integrated RDMA 
> > support will arrive to the market? (or will they?)
> 
> They arrived a while ago, didn't seem to make much of a splash. I don't
> personally think much of offload.

TOE folk don't seem to understand the concept of fast-paths.
sure, RDMA is attractive, but does that mean the whole TCP stack
(plus some new extra RDMA gunk) needs to go onto the nic?
suppose you had a nic which could generate packets in response to 
very specific filters on incoming packets.  in other words, "reflex" 
responses to the expected state transitions, avoiding host involvement
if the pattern is as expected.

of course, it's also true that TCP has very little justification
in a cluster setting, so what's TOE for?  trying to run really giant
webservers on a single K6-2?  most internet-related TCP services 
can be quite readily clusterized in the first place, so scaling 
is not a problem.

one could easily argue that network state machines have shown far 
less innovation and paradigm shift than graphics accelerators.
and look at the awesome amount of offload in your video card - 
it could easily have more transistors and flops than your host cpu.
as far as I can tell, this argument only fails because the mass 
market is not anywhere close to being net-bottlenecked, and that 
it's harder to throw hardware at networking.  it's easy to be limited
by graphics (turn up the resolution, framerate, quality, AA, etc),
and it's easy to throw another dozen pixel pipelines at the problem.

imagine if you had an interconnect coprocessor with 220M transistors
and 30 GB/s private memory bandwidth sitting on 16x PCI-E.  the only 
think I can think of to use that horsepower for would be a distributed
directory-based shared-memory scheme that implemented FP collectives...