[Beowulf] Home beowulf - NIC latencies

Fri Feb 4 04:35:22 PST 2005

At 00:29 4-2-2005 -0800, Bill Broadley wrote:
>On Thu, Feb 03, 2005 at 04:53:27AM +0100, Vincent Diepeveen wrote:
>> Good morning!
>> 
>> With the intention to run my chessprogram on a beowulf to be constructed
>> here (starting with 2 dual-k7 machines here) i better get some good advice
>> on which network to buy. Only interesting thing is how fast each node can
>> read out 64 bytes randomly from RAM of some remote cpu. All nodes do that
>> simultaneously.
>
>Is there any way to do this less often with a larger transfer?  
>If you
>wrote a small benchmark that did only that (send 64 bytes randomly
>from a large array in memory) and make it easy to download, build, run,
>and report results, I suspect some people would.

One way pingpong with 64 bytes will do great.

Shared memory examples i have plenty, but one way pingpong approaches it
excellent. Just multiply the time with 2 and one knows the bound :)

>> The faster this can be done the better the algorithmic speedup for parallel
>> search in a chess program (property of YBW, see publications in journal of
>> icga: www.icga.org). This speedup is exponential (or better you get
>> punished exponential compared to single cpu performance).
>> 
>> Which network cards considering my small budget are having lowest latencies
>> can be used?
>
>Define small budget.  For more than 2 nodes myrinet needs a switch.
>Do you expect to be totally network latency bound?  How low is enough
>to keep the processors busy?

CPU's are 100% busy and after i know how many times a second the network
can handle in theory requests i will do more probes per second to the
hashtable. The more probes i can do the better for the game tree search.

>> quadrics/dolphin seems bit out of pricerange. Myrinet is like 684 euro per
>> card when i altavista'ed online and i wonder how to get more than 2 nodes
>> to work without switch. Perhaps there is low cost switches with reasonable
>> low latency?
>Do you know that gigabit is too high latency?

The few one way pingpong times i can find online from gigabit cards are not
exactly promising, to say it very polite. Something in the order or 50 us
one way pingpong time i don't even consider worth taking a look at at the
picture.

Each years cpu's get faster. For small networks 10 us really is the upper
limit.

>Can't you send enough
>work, like say search 3 moves ahead on the head node, then for each legal
>move send that search tree to a different node?  Each node would reply with
>the highest ranked moves when done.

Let's not discuss parallel chess algorithm too much in depth. 100 different
algorithms/enhancements get combined with each other. They are not the
biggest latency problem. The latency problem is caused by the hashtable.
Hashtable is a big cache. The bigger the better. It avoids researching the
same tree again.

In games like chess and every search terrain (even simulated flight) you
can get back to the same spot by different means causing a transposition.
Like suppose you start the game with 1.e4,e5 2.d4 that leads to the same
position like 1.d4,e5 2.e4. So if we have searched already 1.e4,e5 2.d4
that position P we store into a large cache. Other cpu's first want to know
whether we already searched that position. 

Those hashtable positions get created quite quickly. Deep Blue created them
at a 100 million positions a second and simply didn't store vaste majority
in hashtable (would be hard as it was in hardware). That's one of the
reasons why it searched only 10-12 ply, already in 1999 that was no longer
spectacular when 4 processor pc's showed up at world champs. 

At a PC with a shared hashtable nowadays i get 10-12 ply (ply = half move,
full move is when both sides make a move) in a few seconds, searching a
100000 positions per second a cpu.

So before we start searching every node (=position) we quickly want to find
out whether other cpu's already searched it.

At the origin3800 at 512 processors i used a 115 GB hashtable (i started
search at 460 processors). Simply because the machine has 512GB ram.

So in short you take everything you can get.

The search works with internal iterative deepending which means we first
search 1 ply, then 2 ply, then 3 ply and so on.

The time it takes to get to the next iteration i hereby define as the
branching factor (Knuth has a different definition as he just took into
account 1 algorithm, the 'todays' definition looks more appropriate).

In order to search 1 ply deeper obvious it's important to maintain a good
branching factor. I'm very bad in writing out mathematical proofs, but it's
obvious that the more memory we use, the more we can reduce the number of
legal moves in this position P as next few ply it might be in hashtable,
which trivially makes the time needed to search 1 ply deeper shorter.

Storing closer to the root (position where we started searching) is of
course more important than near the leafs of the search tree.

When for example not storing in hashtable last 10 ply near the leafs in an
overnight experiment the search depth dropped at 460 processors from 20 ply
to 13 ply.

Of course each processor of supercomputers is deadslow for game tree search
(it's branchy 100% integer work completely knocking down the caches), so
compared to pc's you already start at a disadvantage of a factor 16 or so
very quickly, before you start searching (in case of TERAS i had to fight
with outdated 500Mhz MIPS processors against opterons and high clocked quad
Xeons), so upgrading my own networkcards is more clever. 

Yet getting yourself a network even between a few nodes as quick as those
supercomputers is not so easy...

Additional your own beowulf network you can first decently test at before
playing at a tournament, and without good testing at the machine you play
at in tournaments you have a hard 0% chance that it plays well. 

The only thing in software that matters is testing.

>-- 
>Bill Broadley
>Computational Science and Engineering
>UC Davis