[Beowulf] Home beowulf - NIC latencies

Sat Feb 5 19:36:20 PST 2005

At 21:27 5-2-2005 -0500, Patrick Geoffray wrote:
>Hi Vincent,
>
>Vincent Diepeveen wrote:
>>>>CPU's are 100% busy and after i know how many times a second the network
>>>>can handle in theory requests i will do more probes per second to the
>>>>hashtable. The more probes i can do the better for the game tree search.
>>>
>>>With a gigE network that sounds like 40us or so.  With Myrinet or IB
>>>it's in the 4-6us range.  If you bought dual opterons with the special
>> 
>> 
>> At the quadrics and dolphin homepage they both claim 12+ us for Myrinet.
>
>Seriously, here are MPI latencies with MX on F cards on Opteron (PCI-X), 
>that includes fibers and a switch in the middle:
>
>    Length   Latency(us)    Bandwidth(MB/s)
>         0       2.684          0.000
>         1       2.874          0.336
>         2       2.898          0.690
>         4       2.978          1.343
>         8       2.965          2.699
>        16       2.993          5.347
>        32       3.409          9.388
>        64       3.563         17.960
>       128       3.977         32.185
>       256       5.699         44.916
>
>Quadrics would be lower by a 1.5 us, I don't know about Dolphin, I 
>didn't hear about noticeable SCI clusters in a long time.
>
>> I am very impressed by the quadrics and dolphin cards. Probably by
>> infinipath too when i check them out. Will do. 
>> 
>> I'm not so impressed yet by myrinet actually, but if cluster builders can
>> earn a couple of hundreds of dollars more on each node i'm sure they'll
do it.
>
>I don't think Myrinet would be the cheapest, I am sure you can get a 
>better deal from desperate interconnect vendors.
>
>What does not impress you in Myrinet ?

Thanks for your kind answer Patrick,

Obviously i mentionned that number because i read it elsewhere.

Well a number of points bother my mind from which majority is true for
others as well. But first let me note that i'm not against myrinet in
general. I am just trying to solve a very specific case. For that specific
case i'm not so impressed.

Note that so far i didn't find any desperate vendor. For sure quadrics
doesn't look desperate to me, they aren't even selling old cards anymore
though they must have still thousands of them lying at home from returned
upgraded networks. Finding second hand highend cards seems to be very seldom.

First of all i'm interested in how quick i can get 4-64 bytes from remote
memory. So not from some kind of network card cache, as myrinet doesn't
have some megabytes on chip, but just a few tens of kilobytes. The memory
has to come therefore from the remote nodes main memory, at a random adress
in the main memory. No streaming at all happens. that 400 ns extra that the
TLB gives is definitely not the problem i guess. 

The problem for me is to understand: "how do you get that memory at a
cluster?"

A latency on paper says of course nothing when you can't actually get it
within that time.

"Paper supports everything."
    Arturo Ochoa (Caracas, Venezuela)

I hope everyone realizes that an important consequence from beowulf
clusters is that you actually want to *use* all those cpu's you have to
your avail.

So every cpu has a program running that eats 100% system time. Because if
it wouldn't use 100% system time, you wouldn't need a cluster!

>From that 100% system time obviously you must be prepared to give away some
to serve other nodes as quickly as possible doing a read. 

All latencies i see quoted at all hardware sites, it is very hard to figure
for me out whether that's a latency that is supported by paper, or whether
it's a practical latency i can take into account as a programmer with all
software layers overhead when each cpu is 100% running a program.

Secondly, but as i'm not a cluster expert i don't know how to avoid that,
it's of course a big LOSS in sequential speed if my program each few
instructions must check whether there is some MPI message to get handled.
If i check a lot that will slow down my program 20 times. If i don't check
a lot, other cpu's will have to wait longer and that defeats the purpose of
a fast network card.

Factor 20 is about the slowdown of the average 'old' supercomputer
chessprograms which use MPI type solutions. Zugzwang (Paderborn-Siemens),
P.Conners (Paderborn-Siemens), cilkchess (MIT). I've been playing with my
own eyes against those programs in world champs and despite that it has
happened that i played at the same hardware with a similar amount of cpu's
and a program having factor 100 more chessknowledge (which slows down the
program *considerable*), the actual speed at which the program searches
nodes was up to factor 5-10 faster. 

Now a few years ago this was not a major problem because for example
Cilkchess which obviously ran factor 20-40 times slower than it could, used
1800 processors for example in world champs 1995 (Hong kong) and 512
processors in world champs 1999 (Paderborn). Of course because 1 processor
was real real fast compared to the speed of 1 pc processor in those days,
they practical were searching a lot deeper than pc programs (and both
played excellent for its days, especially Don Dailey needs to get a big
compliment for that). 

However if i show up with 2 pc's and 2 network cards, then it sure matters
when i lose a lot of speed. 

Obviously for embarassingly parallel software this is no issue, but usually
for embarrassingly parallel software all you need is gigabit ethernet. 

There is so many MPI applications which are not exactly embarassingly
parallel from which you see that a decent programmer single cpu would be
doing that 20 times faster. Or to quote someone who has been doing such
rewriting work for some physical applications that run here and there: "I
didn't blink my eyes when i managed to speedup an application factor 1000".

So it is very interesting for us all and me especially to understand how
*fast* you can get that memory under full load of all the logical cpu's.

Third each pc has 2 cheapo k7 processors which are a lot slower than opterons.

Second problem i have is that i can get easily dual k7 pc's from
chessplayers and they can get bought cheap still. Dual k7 is practical same
speed like a dual xeon 3.06Ghz Northwood with all memory slots filled with
2-2-2 DIMMS for DIEP. So just compare the price of such a system with a
cheapo dual k7 with registered cas3 RAM. 

Those dual k7's have 64 bits 66Mhz slots, not pci-x as far as i know and
also those who do have A64's or P4's usually don't have pci-x onboard
either. Sure there is boards that have them and i'm sure that if you make a
network

Dolphin can deliver 'bytes' they say at their homepage in 3.3 us at MPX
mainboards and claim somewhere a paper latency of 1.x us. 

What is the achieved read speed to remote memory myrinet gets at 64 bits /
66Mhz in software, so ready to use 4-64 bytes for applications? 

I'm not asking it to be accurate within 400ns, as that's the delay you'll
have from TLB trashing the remote node. But accuracy within 1.5 us would be
quite nice.

First of all for integer intensive applications i'm doing fastest processor
is opteron, k7 comes second and P4 comes third. Exception is a P4 machine
equipped with the most expensive stuff (2-2-2 ram and all banks filled)
good mainboard and northwoods and overclocked at the mainboard. However for
that price a dual opteron can get bought and it just blows away that P4
bigtime.

Every year that new software gets released of course that P4 gets slower,
because newer software only gets more and more complex with more options
and will fit less perfectly in P4's small tiny caches, let alone when we
get a lot of 64 bits programs. They won't fit at all in those tiny slow
caches.

So until the dual core opterons arrive at low cost, obviously you can make
dual k7 nodes for just a few hundreds of dollar a node. 

When adding new nodes which in the future no doubt are dual opteron, you
still run further with those dual k7 nodes and want to mix them obviously
with dual opterons. Is that possible?

>Patrick
>-- 
>
>Patrick Geoffray
>Myricom, Inc.
>http://www.myri.com
>
>