From e.scott.atchley at gmail.com Mon Mar 2 14:08:31 2026 From: e.scott.atchley at gmail.com (Scott Atchley) Date: Mon, 2 Mar 2026 09:08:31 -0500 Subject: [Beowulf] [EXTERNAL] IB vs. Ethernet In-Reply-To: <8A12224D-EDAA-4A1C-96FC-28B5DACCCB31@serissa.com> References: <1f8b8ac8-2ff7-4496-8d68-a6356da7b342@ucar.edu> <20260221082852.GA10552@rd.bx9.net> <83B6174F-BAA2-4F44-BA3B-BEA56CA07044@serissa.com> <8A12224D-EDAA-4A1C-96FC-28B5DACCCB31@serissa.com> Message-ID: On Wed, Feb 25, 2026 at 9:04?PM Lawrence Stewart wrote: > Arista has published 10G latency measurements for QSFP based copper and > optical links from 1-6 meters > > Copper latency looks like about 5 ns per meter while optical is a little > slower for short cables and a little faster for long ones. > > > For 400 GB link modules, apparently you can use ?analog? optical > transceivers with 20 ns delays plus fiber delay up to 100 meters. You can > also use DSP based ones that could be 100 ns > > The Optical Analog/Clock and Data Recovery cables are much lower latency > than the Active Optical Cables with retimers in them and perhaps equalizers. > > For connections within a rack, you can also use Direct Attach Copper, > which is just a twinax parallel cable, up to about 5 meters. Or there are > Active Electrical Cables with equalizers that are a bit slower. > > The price tags for the optical 400G cables are eye-popping. > > I realize that most AI work is bandwidth-focussed, and a microsecond is > fine, but I have a soft spot for SHMEM 8 byte puts and gets, and there is > always a role for Barrier and small AllGathers. > > -L > How much does FEC add? I have been under the impression that it is now mandatory ?100Gbps. > > > > On Feb 25, 2026, at 19:20, Lux, Jim (US 430E) > wrote: > > > > > > > > -----Original Message----- > > From: Beowulf On Behalf Of Lawrence > Stewart > > Sent: Saturday, February 21, 2026 4:34 AM > > To: beowulf at beowulf.org > > Cc: Lawrence Stewart > > Subject: [EXTERNAL] Re: [Beowulf] IB vs. Ethernet > > > > > > > >> On Feb 21, 2026, at 3:28?AM, Greg Lindahl wrote: > >> > >> On Thu, Jan 15, 2026 at 08:28:36PM -0500, Lawrence Stewart wrote: > >> > >>> I think a 64 byte store at a core should directly become a packet. No > on-die-network, no coherence, no root complex, no host-fabric adapter. > Incoming short messages should be delivered directly to a fifo in the > relevant core. > >> > >> I think that's a great idea! > >> > >> ? greg > >> > > > > > > As Greg, I think, is hinting, this idea was a thing that QLogic HFI?s > did, using the core write combining buffers to good effect. It seems like > it is also the basic idea behind MOVDIR64B, which specifies that a 64 byte > write will be atomic all the way down. > > > > Using core registers for messaging is much older, with Transputers, > Tilera, Dally?s J Machine and arguably Cray E-registers. > > > > What this is really about is end to end latency. We?ve been stuck at 1 > microsecond since the Cray T3D 30 years ago, in spite of 100x improvements > in link speed. If we can eliminate all the middlemen and get switches back > to 50 ns forwarding, I think we should be able to get 300 ns end to end in > a good size system. > > > > -Larry > > > > > > Indeed, I suspect the 1 microsecond probably ties to something else that > was convenient - If you're not running parallel wires (lanes) then sending > 1000 bits at 1Gbps takes 1 microsecond. > > > > And if the actual link gets faster, the messages get bigger, so that > they still take 1 microsecond. > > > > There are some practical issues - As your symbol rate gets higher on the > wire, things like impedance discontinuities causing reflections become more > important. You have a transition from die to package, one from package to > board, one from board to connector/cable. And those all have ~1-10 ns > kind of time scales. Stack all those up and it can take a long time for > the cascade of reflections to die out. > > > > The fix, today, is to put equalizers (preferably adaptive equalizers) > that essentially "undistort" the waveform. But those equalizers have to > look at many symbol times to work (typically, they're implemented as a > tapped delay line with weights on each tap and summed - a FIR filter), > which then means that your first bit out is delayed by however many symbols > are in the filter's delay line. I suspect that for "commodity" hardware, > there's a particular length of delay line that is long enough to > accommodate all possible wiring configurations. > > > > Let's look at Ethernet - the maximum ethernet run for GigE is 100 > meters, which not so oddly, is about 500 ns long (propagation speed is > ~0.66c due to the dielectric and capacitance/inductance of the twisted > pair). So the time for a reflection to get back to the sending end is, > hmmm, 1 microsecond. > > > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: