[Beowulf] fast interconnects, HT 3.0 ...
Richard Walsh
rbw at ahpcrc.org
Tue May 23 10:14:08 PDT 2006
Eugen Leitl wrote:
> On Tue, May 23, 2006 at 11:04:52AM -0500, Richard Walsh wrote:
>
>
>>> I don't know (like it would stop me), but there are few-port HT switches,
>>> and if there are several ports on one chassis one could wire up
>>> some topology, which, hopefully, will match the problem.
>>>
>>>
>>>
>> HT switches ... ?? ... can you point me to a reference?
>>
>
> Google knows of several, and even some hotplug connector specs.
> See also http://www.commsdesign.com/design_corner/showArticle.jhtml?articleID=16503595
> for a review.
>
>
Thanks.
>>> I'm not sure how this is different from vanilla packet-switched
>>> MPI network. It's not about maintaining memory coherency.
>>>
>>>
>> Well, of course you can run MPI over it as you can on the Cray and
>> Altix, but
>> you are artificially separating memory in software that is in fact
>> closer in hardware.
>>
>
> If I have some 10^3 nodes, and the context is not read-only
> I always have to wait to make sure nobody is trying to write to
> the same location. It's a worst case, but in a relativistic universe
> maintaining the illusion of coherence over many copies is an
> expensive one. Lots of signalling back and forth, until you
> know the state is settled for sure. This might work for 8, 16, maybe 32 systems
> in a close enough location -- but with 10^3 or 10^6 nodes it
> has to give.
>
Mmm ... I do not think we are connecting. Off board non-coherence
is managed
by the application and is made possible in part by pGAS syntax in
UPC/CAF.
We have some very novel, fine grained UPC CFD codes running on the
Cray X1
which do indexed adaptive mesh regeneration to model the flapping
wings of a model
humming bird to follow its shedding vortices. Performance is good
and we manage the
off board incoherence/synchrony nicely. It would have been almost
impossible to write
in MPI and its performance would be poor. The application has
reasonably good scaling
properties as it is. It even runs on our cluster ... yes in UPC ...
(albeit much more slowly).
it is has some data locality (not GUPS like) but the remeshing
approach is fine grained
... the "messages" are direct remote memory puts and gets driven my
vector instructions.
HT 3.0 is presumbly more elaborate than the CRAY X1 ISA, but can
provide similar,
more direct, off-chassis, non-coherent memory addressing, No? This
is in tune with
the UPC and CAF programming models.
>> That is where the pGAS programming models become more efficient. Remote
>> memory references expressed in the syntax and compiled to
>> instructions for
>> direct puts and gets without management or translation by a NIC. It
>>
>
> We're talking lunatic fringe interconnects where the wire or the fibre
> is your FIFO, and the switch makes a routing decision after a few bits
> of the headers have streamed past -- which is reasonably close to c.
> With 10 GBit data rates and above that's a quick decision to take.
> At 10 GBit/s your serial bit is just ~3 cm or 100 ps short -- in vacuum.
> Shorter in glass, and much shorter in copper. So a very short message
> can arrive within a few ns, which is order of magnitude RAM access.
>
I am talking about improving on the ~1500 nanos required by the best
of today's interconnects
for a single, remote 8-byte reference, and perhaps further hiding
that reduced latency in a
pipelined vector load operation inside the pipe. The question
was: What can HT 3.0 provide
non-coherently, off board in this regard? Maybe the answer is
nothing ... but I have not
heard it cogently argued yet.
>
>> would seem
>> that HT 3.0 supports this model across chassis as long as the
>> programmer manages
>> memory synchronization.
>>
>
> You have to bite the bullet and manage synchronization by higher-order
> protocols. The physical world at the bottom is fundamentally message-passing.
> You might notice it very much if you're working on us scale, but
> in ns and below it you can't ignore it.
>
OK ... everything is a message ... even a Cray X1 vector write, but
I am comparing MPI
messages with something much smaller and more primitive.
>
>>>> Sounds like the Cray X1E pGAS memory model. Is there a role for
>>>>
>>>>
>>> I don't think there is any other model but message passing. It's not
>>> like this is a ccHT a la HORUS
>>> http://en.wikipedia.org/wiki/HORUS_interconnect
>>>
>>>
>> The inter-chassis, but non-coherent interface that HT 3.0 supports
>> would seem to work
>> very nicely with UPC and CAF. They run very well on the Cray X1,
>> which provides
>> coherent memory on-board only as well.
>>
>
>
>
--
Richard B. Walsh
Project Manager
Network Computing Services, Inc.
Army High Performance Computing Research Center (AHPCRC)
rbw at ahpcrc.org | 612.337.3467
-----------------------------------------------------------------------
This message (including any attachments) may contain proprietary or
privileged information, the use and disclosure of which is legally
restricted. If you have received this message in error please notify
the sender by reply message, do not otherwise distribute it, and delete
this message, with all of its contents, from your files.
-----------------------------------------------------------------------
More information about the Beowulf
mailing list