[Beowulf] Broadcast - not for HPC - or is it?
Bogdan Costescu
bcostescu at gmail.com
Tue Oct 5 06:23:30 PDT 2010
On Fri, Sep 24, 2010 at 12:21 PM, Matt Hurd <matthurd at acm.org> wrote:
> I'm associated with a somewhat stealthy start-up. Only teaser product
> with some details out so far is a type of packet replicator.
>From your description as well as from a quick look at the website, it
looks and smells like a hub - I mean a dumb hub, like those which
existed in the '90s before switching hubs (now called switches) took
over. If so, then HPC might not be a good target for you, as it has
long ago adopted switches for good reasons.
> Primarily focused on low-latency
> distribution of market data to multiple users as the port to port
HPC usage is a mixture of point-to-point and collective
communications; most (all?) MPI library use low level point-to-point
communications to achieve collective ones over Ethernet.. Another
important point is that the collective communications can be started
by any of the nodes - it's not one particular node which generates
data and then spreads it to the others; it's also relatively common
that 2 or more nodes reach the point of collective communication at
the same time, leading to a higher load on the interconnect, maybe
congestion.
What might be worth a try is a mixed network config where
point-to-point communications go through one NIC connected to a switch
and the collective communications that can use a broadcast go through
another NIC connected to your packet replicator. However, IMHO it
would only make sense if the packet replicator makes some guarantees
about delivery: f.e. that it would accept a packet from node B even if
a packet from node A is being broadcasted at that time; this packet
from node B would be broadcasted immediately after the previous
transmission has finished. This of course means that each link
NIC-packet replicator needs to be duplex and some buffering should be
present - this was not the case of the dumb hubs mentioned earlier. I
think that such a setup would be enough for MPI_Barrier and MPI_Bcast.
One other HPC related application that comes to my mind is distributed
storage. One of the main problems is keeping redundant metadata to
prevent the whole storage going down if one of the metadata servers
goes down. With such a packet replicator, the active metadata server
can broadcast it to the others; this would be just one operation -
with a switched architecture, this would require N-1 operations (N
being the total nr. of metadata servers) and would loose any pretence
of atomicity and speed.
> They suggested interest in bigger port counts and mentioned >1000 ports.
Hmmm, if it's only like a dumb hub (no duplex, no buffering), then I
have a hard time imagining how it would work at these port counts -
the number of collisions would be huge...
Cheers,
Bogdan
More information about the Beowulf
mailing list