[Beowulf] Register article on Cray Cascade

Sat Nov 10 06:15:02 PST 2012

I say i wrote a benchmark Scott, that proves that most latencies  
quoted are pretty useless to my software,
as networks with 50 ns routers and a hop or 7 which on paper was 480  
ns 'in theory' from SGI,
so a blocked read in theory of 960 ns for SGI, that was when i  
measured 5.8 us for my benchmark
on average.

That's what my test is doing and not with a single core and the  
entire box idle, let alone just
doing paper math in laboratories.

You are busy with theory that is not practical to software running on  
that box.

I'm busy here with testsoftware which measures and gives a number and  
in general if you take many
hops, more than 1. So if you have a large supercomputer with many  
hops on average when going
from core to core, then it is rather hard to believe that the praxis  
for the same benchmark is different
for the number it outputs that the latency is.

Now you can put a sticker on hardware that it's a 2 ns router or a  
100 ns router, what counts is
what the software shows me that it takes to get that latency with all  
cores busy.

In this case Cray claims that moving from software to the first  
router is already 1.2 us roughly.
And that then taking another 5 hops happens in 0.5 us.

I use 'rough numbers', a few hundreds nanoseconds more or less is not  
so interesting in the above number
as it's a laboratory number seemingly.

In reality if we compare that with other networks, we can simply see  
that the first hop is a lot cheaper than 1.2 us,
and that the remaining hops then in a big network really slow down  
for the blocked read latencies.

Now it's very difficult to believe the claim that if you already lose  
a much larger overhead for the first hop,
that taking the rest of the hops suddenly is just a 'part of the  
price' of the first hop.

You do not buy the machine for a theoretic test of 1 core to 1 core  
with entire box idle.

If routers and switches simply have been designed for this 1 to 1  
core test and therefore cache or hash or
whatever do to optimize just 1 route, assuming further entire machine  
is idle, that's telling more than enough
of a story on what actually happens.

If you write software for these machines, you want to get the maximum  
out of the machine. You don't want to
get fooled, that's my words, by hardware that has been designed and  
optimized for an idle machine where you
measure just 1 core to 1 core.

We already see how in high frequency trading the big traders are  
nonstop keeping their 'line busy', to not lose
the momentum of the cache of the routers, giving them a few  
microseconds advantage in trading speed over traders
who trade little, so where the hardware didn't optimize the path for.  
They are only busy fooling the routers of the
exchange network, with the guys running the exchange having no clue  
why obviously.

All this because of how the hardware gets designed with a single core  
to single core in mind, just in order to show
a good one way pingpong on a further entirely idle box, which is a  
total useless number from a benchmark,
as you want a number that is valid with all cores busy and working  
for you, doing something useful for you.

Using in total 2 cores from a large cluster/supercomputer/network is  
just
as interesting as manned missions to Mars are for now.

What i go for is the latency that you can achieve in software with  
all cores busy and we simply see that the claimed
network speed of manufacturers is factors off with the benchmark i  
wrote that's nonstop testing this blocked read number
with all cores running and all cores polling at the same time.

I wrote a benchmark measuring that, a decade ago.

So i'm the one here who is busy with measured facts of a fully busy  
supercomputer and you're talking from a laboratory.

On Nov 10, 2012, at 1:54 PM, atchley tds.net wrote:

> Vincent,
>
> You are changing the item being tested. You disputed my statement  
> that switches can have a latency as low as 100-150 ns. I described  
> how to test the latency of a single hop (I neglected to say that  
> the two NICs must be connected to the same switch chip i.e. blade,  
> cross-bar, etc). You can additional measure multi-hop links the  
> same way by choosing your ports correctly.
>
> Please don't change rules because you cannot admit you are wrong.
>
> Scott
>
>
> On Fri, Nov 9, 2012 at 3:40 PM, Vincent Diepeveen <diep at xs4all.nl>  
> wrote:
> that's not how fast you can get the data at each core.
>
> The benchmark i wrote is actually a reflection of how a hashtable  
> works for Game Tree Search in general.
> the speedup of it is exponential, so doing it in a different way we  
> can PROVE (as in mathematical proof)
> that you will have troubles getting the same exponent (which we cal  
> branching factor).
>
> So practical testing then what you can achieve from core to core is  
> what matters.
>
> The first disappointment then happens with the new opteron cores  
> actually, namely that AMD has designed
> a memory controller which just doesn't scale if you use all cores.
>
> Joel Hruska performed some tests there (not sure where he posted it  
> online).
> We see then that the bulldozer type architecture still scales ok if  
> you run benchmarks single core.
> Sure no real good latency but still...
>
> Yet if you move then from using 4 processes to measure to 8  
> processes to measure, this
> at a chip we already land at nearly 200 ns, which is real slow.
>
> The same effect happens when at a big supercomputer you run at full  
> throttle with all cores.
>
> Manufacturers can claim whatever, but it is always paper math.
>
> If they ever release something it's some sort of single core,  
> whereas in the first place that
> box didn't get ordered to work single core.
>
> You don't want the performance at a single core in a lab with  
> temperatures nearby 0 Kelvin,
> you want to see that the box you got performs like this with all  
> cores running :)
>
> And on the number posted you already start losing at Cray, starting  
> with the actual CPU's that suck when you use all cores.
>
>
> On Nov 9, 2012, at 8:38 PM, atchley tds.net wrote:
>
> Vincent, it is easy to measure.
>
> 1. Connect to NICs back-to-back.
> 2. Measure latency
> 3. Connect machines to switch
> 4. Measure latency
> 5. Subtract (2) from (4)
>
> That is how we did it at Myricom and that is how we do it at ORNL.
>
> Try it sometime.
>
> Scott
>
>
> On Fri, Nov 9, 2012 at 2:36 PM, Vincent Diepeveen <diep at xs4all.nl>  
> wrote:
>
> On Nov 9, 2012, at 7:31 PM, atchley tds.net wrote:
>
> Modern switches need 100-150 ns per hop.
>
> yeah that's BS when you have software that goes measure that with  
> all cores busy.
>
> I wrote a benchmark to measure that with all cores busy.
>
> The SGI box back then that was having 50 ns switches which would  
> have 'in theory' a latency of 480 ns @ 500 cpu's,
> so 960 for a blocked read, i couldn't get it down to less than 5.8  
> us on average.
>
>
>
>
> There are some things that do not scale per hp such as traversing  
> the PCIE link from socket to NIC and back. So, I see it as 1.2 to  
> go to the router and back and 100 ns per hop.
>
> Scott
>
>
> On Fri, Nov 9, 2012 at 11:17 AM, Vincent Diepeveen <diep at xs4all.nl>  
> wrote:
> The latency estimate taking 5 hops seems a tad optimistic to me
> except when i read the English wrong and they mean 1.7 microseconds a
> hop making it for a 5 hop 5 * 1.7 = 8.5 microseconds in total.
>
> "Not every node is only one hop away, of course. On a fully
> configured system, you are five hopes away maximum from any socket,
> so there is some latency. But the delta is pretty small with
> Dragonfly, with a minimum of about 1.2 microseconds for a short hop,
> an average of 1.5 microseconds on average, and a maximum of 1.7
> microseconds for the five-hop jump, according to Bolding."
>
> On Nov 8, 2012, at 7:13 PM, Hearns, John wrote:
>
> > Well worth a read:
> >
> >
> >
> > http://www.theregister.co.uk/2012/11/08/
> > cray_cascade_xc30_supercomputer/
> >
> >
> >
> > John Hearns | CFD Hardware Specialist | McLaren Racing Limited
> > McLaren Technology Centre, Chertsey Road, Woking, Surrey GU21  
> 4YH, UK
> >
> >
> > T:  +44 (0) 1483 262000
> >
> > D:  +44 (0) 1483 262352
> >
> > F:  +44 (0) 1483 261928
> > E:  john.hearns at mclaren.com
> >
> > W: www.mclaren.com
> >
> >
> >
> > The contents of this email are confidential and for the exclusive
> > use of the intended recipient. If you receive this email in error
> > you should not copy it, retransmit it, use it or disclose its
> > contents but should return it to the sender immediately and delete
> > your copy.
> >
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> > Computing
> > To change your subscription (digest mode or unsubscribe) visit
> > http://www.beowulf.org/mailman/listinfo/beowulf
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
> Computing
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
>
>
>