[Beowulf] Home beowulf - NIC latencies
diep at xs4all.nl
Fri Feb 4 11:39:12 PST 2005
At 17:31 4-2-2005 +0000, Ashley Pittman wrote:
>On Fri, 2005-02-04 at 13:35 +0100, Vincent Diepeveen wrote:
>> At 00:29 4-2-2005 -0800, Bill Broadley wrote:
>> >On Thu, Feb 03, 2005 at 04:53:27AM +0100, Vincent Diepeveen wrote:
>> >> Good morning!
>> >> With the intention to run my chessprogram on a beowulf to be constructed
>> >> here (starting with 2 dual-k7 machines here) i better get some good
>> >> on which network to buy. Only interesting thing is how fast each node
>> >> read out 64 bytes randomly from RAM of some remote cpu. All nodes do
>> >> simultaneously.
>> >Is there any way to do this less often with a larger transfer?
>> >If you
>> >wrote a small benchmark that did only that (send 64 bytes randomly
>> >from a large array in memory) and make it easy to download, build, run,
>> >and report results, I suspect some people would.
>> One way pingpong with 64 bytes will do great.
>pingpong is not really the same, adding a random element can slow down
>comms and ideally it sounds like you want a one-sided operation.
>Perhaps you should look at tabletoy (cray shmem) or gups (MPI) as a
Thank you for your answer, I indeed investigated quadrics cards
intensively. Ask your college Daniel Kidger.
The shmem is an ideal solution for what searching algorithms are doing.
Regrettably seems no one is willing to sell old quadrics cards (QM400).
>> CPU's are 100% busy and after i know how many times a second the network
>> can handle in theory requests i will do more probes per second to the
>> hashtable. The more probes i can do the better for the game tree search.
>Are you overlapping comms and compute or doing blocking reads? If you
>are overlapping then the issue rate for reads is more important than the
A node (chessposition in this case) eats on average 10 us. sometimes that's
50us other times it's 1us. That's the time a cpu is busy calculating a
chesstechnical value how good the position is applying human chesspatterns.
Called evaluation function in search world.
Before applying evaluation function one is doing a lookup to the cache
whether one already searched this position. In case of a 2 node beowulf
that means you have 50% odds that this position is in local memory and 50%
chance it's a remote lookup.
The reason for this is very simple by explaining the hash function which in
a lot of different software gets used too (not only search, also encryption
and string matching and all types of caches).
For each piece at each square take a random value ( long long
XOR all values with each other and you have what is called a Zobrist hash
from a position. Very effectively. Nothing beats the speed of Zobrist as
you can do it incremental.
Now suppose we use the lower 20 bits to lookup at 1 million entries. So we
AND the hash number with 2^20 - 1 and lookup at that adress in the hashtable.
Obviously such cache is distributed across the nodes. Each node having an
equal share of the global transpositiontable as it is called officially.
Trivially doing this each 10 us will put too much stress on the network. So
usually one doesn't do it at the leaves itself (called quiescencesearch).
That means only in 20% of the nodes such a thing gets tried. That's already
on average once in each 100 us. The slower the network card, the less
remote hashtable lookups one tries obviously. Finding for each cluster an
optimum search depth when to try it is of course not so difficult to figure
1 lookup reads 64 bytes and that's 4 entries where the position could be
1 entry is 16 bytes and stores quite some information. Apart from a lot of
bits to identify a chessposition, the score is there (20 bits) and what the
best move was in this position.
>> >> quadrics/dolphin seems bit out of pricerange. Myrinet is like 684
>> >> card when i altavista'ed online and i wonder how to get more than 2
>> >> to work without switch. Perhaps there is low cost switches with
>> >> low latency?
>> >Do you know that gigabit is too high latency?
>> The few one way pingpong times i can find online from gigabit cards are not
>> exactly promising, to say it very polite. Something in the order or 50 us
>> one way pingpong time i don't even consider worth taking a look at at the
>> Each years cpu's get faster. For small networks 10 us really is the upper
>10us is easily achievable, I've just measured a read time of a little
>over 3us and a issue rate of 1.33us.
Suppose a 8 node quadrics setup so with a switch in the middle and all
processors trying to read nonstop over the network to a random remote
processor. Each processor reading out of the 64MB on card. So never in the
physical memory of a processor, just at the remote network cards.
What speed is achievable then to read 64 bytes?
SGI with their supercomputers never get better than 5.8 us there (that's
reading 8 bytes) on average (origin3800) when the numaflex routers kick in.
Altix3000 is way worse there. More bandwidth optimized i guess.
>> So before we start searching every node (=position) we quickly want to find
>> out whether other cpu's already searched it.
>> At the origin3800 at 512 processors i used a 115 GB hashtable (i started
>> search at 460 processors). Simply because the machine has 512GB ram.
>> So in short you take everything you can get.
>So is this a parallel algorithm or simply a big "memory farm" you are
>after? You don't hear much of clusters being used for the latter but in
>some cases it's a eminently sensible thing to do.
I take care the cpu's get nearly 100% load and say am prepared to sacrafice
10% of the scaling at a network to read/write latency to the hashtable. So
i just figure out how many reads i can do in 10% system time and fill that
with reads. The other 90% system time it has to evaluate chesspositions and
be busy with the real stuff.
By putting the depth in the search at which it is allowed to read higher or
lower, i can manual adjust the traffic over the network.
More information about the Beowulf