[Beowulf] Shared memory

Thu Jun 23 09:10:21 PDT 2005

I was just yesterday benchmarking our A3400 quad-opteron with dual cores
using UnixBench 4.1 which is not really an SMP benchmark except for the
8 and 16-concurrent shell script runs, and was not too impressed with 
the speed
 increase of those runs either,  judging how much more the CPUs cost.

Compare A3140  (dual opteron 248 single core) with A3400 (quad opteron 
875 dual core):

A3150/raid5 	dual opteron 248 	8G 	FC3 	668 	859 	443
A1300 	dual opteron 852 	4G 	FC3 	806 	964 	497
A1300 	dual opteron 875 	4G 	RHEL3u5 	724 	1329 	744
A3400 	quad opteron 875 	32G 	RHEL3u5 	736 	1691 	1030

      A3150/raid5, dual opteron 248, 8G, FC3

  BYTE UNIX Benchmarks (Version 4.1.0)
  System -- Linux load157.load.penguincomputing.com 2.6.10-1.770_FC3smp #1 SMP Thu Feb 24 18:36:43 EST 2005 x86_64 x86_64 x86_64 GNU/Linux
  Start Benchmark Run: Wed Jun 22 16:51:09 PDT 2005
   2 interactive users.
   16:51:09 up 5 days, 17:38,  2 users,  load average: 0.00, 0.00, 0.00
  lrwxrwxrwx  1 root root 4 Jun 16 02:18 /bin/sh -> bash
  /bin/sh: symbolic link to `bash'
                       1041184892   8447692 979848012   1% /
Dhrystone 2 using register variables     7143742.6 lps   (10.0 secs, 10 samples)
Double-Precision Whetstone                 1861.4 MWIPS (10.0 secs, 10 samples)
System Call Overhead                     1754780.4 lps   (10.0 secs, 10 samples)
Pipe Throughput                          637803.6 lps   (10.0 secs, 10 samples)
Pipe-based Context Switching              99822.9 lps   (10.0 secs, 10 samples)
Process Creation                           8705.8 lps   (30.0 secs, 3 samples)
Execl Throughput                           3331.6 lps   (30.0 secs, 3 samples)
File Read 1024 bufsize 2000 maxblocks    686255.0 KBps  (30.0 secs, 3 samples)
File Write 1024 bufsize 2000 maxblocks   323633.0 KBps  (30.0 secs, 3 samples)
File Copy 1024 bufsize 2000 maxblocks    221495.0 KBps  (30.0 secs, 3 samples)
File Read 256 bufsize 500 maxblocks      258891.0 KBps  (30.0 secs, 3 samples)
File Write 256 bufsize 500 maxblocks      99274.0 KBps  (30.0 secs, 3 samples)
File Copy 256 bufsize 500 maxblocks       68148.0 KBps  (30.0 secs, 3 samples)
File Read 4096 bufsize 8000 maxblocks    1396802.0 KBps  (30.0 secs, 3 samples)
File Write 4096 bufsize 8000 maxblocks   764279.0 KBps  (30.0 secs, 3 samples)
File Copy 4096 bufsize 8000 maxblocks    468049.0 KBps  (30.0 secs, 3 samples)
Shell Scripts (1 concurrent)               3968.3 lpm   (60.0 secs, 3 samples)
Shell Scripts (8 concurrent)                858.7 lpm   (60.0 secs, 3 samples)
Shell Scripts (16 concurrent)               443.3 lpm   (60.0 secs, 3 samples)
Arithmetic Test (type = short)           451669.6 lps   (10.0 secs, 3 samples)
Arithmetic Test (type = int)             461532.8 lps   (10.0 secs, 3 samples)
Arithmetic Test (type = long)            265095.8 lps   (10.0 secs, 3 samples)
Arithmetic Test (type = float)           905143.8 lps   (10.0 secs, 3 samples)
Arithmetic Test (type = double)          898198.1 lps   (10.0 secs, 3 samples)
Arithoh                                  9944308.6 lps   (10.0 secs, 3 samples)
C Compiler Throughput                      1264.0 lpm   (60.0 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places         118607.5 lpm   (30.0 secs, 3 samples)
Recursion Test--Tower of Hanoi           144138.2 lps   (20.0 secs, 3 samples)

                     INDEX VALUES
TEST                                        BASELINE     RESULT      INDEX

Dhrystone 2 using register variables        116700.0  7143742.6      612.1
Double-Precision Whetstone                      55.0     1861.4      338.4
Execl Throughput                                43.0     3331.6      774.8
File Copy 1024 bufsize 2000 maxblocks         3960.0   221495.0      559.3
File Copy 256 bufsize 500 maxblocks           1655.0    68148.0      411.8
File Copy 4096 bufsize 8000 maxblocks         5800.0   468049.0      807.0
Pipe Throughput                              12440.0   637803.6      512.7
Process Creation                               126.0     8705.8      690.9
Shell Scripts (8 concurrent)                     6.0      858.7     1431.2
System Call Overhead                         15000.0  1754780.4     1169.9
                                                                 =========
     FINAL SCORE                                                     668.0

    A3400, quad 875 (4x2cores), RHEL3ASu5 (2.4.21-32.ELsmp kernel) 32G
    PC2700 RAM

  BYTE UNIX Benchmarks (Version 4.1.0)
  System -- Linux eng223.eng.penguincomputing.com 2.4.21-32.ELsmp #1 SMP Fri Apr 15 21:03:28 EDT 2005 x86_64 x86_64 x86_64 GNU/Linux
  Start Benchmark Run: Wed Jun 22 16:52:16 PDT 2005
   4 interactive users.
   16:52:16  up 36 min,  4 users,  load average: 0.50, 1.69, 3.36
  lrwxrwxrwx    1 root     root            4 Jun 22 14:17 /bin/sh -> bash
  /bin/sh: symbolic link to bash
  /dev/sda2             32834548   5628576  25538024  19% /
Dhrystone 2 using register variables     7681230.5 lps   (10.0 secs, 10 samples)
Double-Precision Whetstone                 1723.9 MWIPS (9.9 secs, 10 samples)
System Call Overhead                     1744429.8 lps   (10.0 secs, 10 samples)
Pipe Throughput                          1232909.3 lps   (10.0 secs, 10 samples)
Pipe-based Context Switching             152714.6 lps   (10.0 secs, 10 samples)
Process Creation                           7975.6 lps   (30.0 secs, 3 samples)
Execl Throughput                           3265.8 lps   (29.4 secs, 3 samples)
File Read 1024 bufsize 2000 maxblocks    944176.0 KBps  (30.0 secs, 3 samples)
File Write 1024 bufsize 2000 maxblocks   251362.0 KBps  (30.0 secs, 3 samples)
File Copy 1024 bufsize 2000 maxblocks    193918.0 KBps  (30.0 secs, 3 samples)
File Read 256 bufsize 500 maxblocks      510216.0 KBps  (30.0 secs, 3 samples)
File Write 256 bufsize 500 maxblocks      80901.0 KBps  (30.0 secs, 3 samples)
File Copy 256 bufsize 500 maxblocks       63498.0 KBps  (30.0 secs, 3 samples)
File Read 4096 bufsize 8000 maxblocks    1517443.0 KBps  (30.0 secs, 3 samples)
File Write 4096 bufsize 8000 maxblocks   634337.0 KBps  (30.0 secs, 3 samples)
File Copy 4096 bufsize 8000 maxblocks    447108.0 KBps  (30.0 secs, 3 samples)
Shell Scripts (1 concurrent)               3787.2 lpm   (60.0 secs, 3 samples)
Shell Scripts (8 concurrent)               1690.6 lpm   (60.0 secs, 3 samples)
Shell Scripts (16 concurrent)              1030.2 lpm   (60.0 secs, 3 samples)
Arithmetic Test (type = short)           445516.1 lps   (10.0 secs, 3 samples)
Arithmetic Test (type = int)             454972.9 lps   (10.0 secs, 3 samples)
Arithmetic Test (type = long)            273406.6 lps   (10.0 secs, 3 samples)
Arithmetic Test (type = float)           982122.0 lps   (10.0 secs, 3 samples)
Arithmetic Test (type = double)          976282.9 lps   (10.0 secs, 3 samples)
Arithoh                                  10019944.0 lps   (10.0 secs, 3 samples)
C Compiler Throughput                      1326.2 lpm   (60.0 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places         128918.6 lpm   (30.0 secs, 3 samples)
Recursion Test--Tower of Hanoi           136001.1 lps   (20.0 secs, 3 samples)

                     INDEX VALUES
TEST                                        BASELINE     RESULT      INDEX

Dhrystone 2 using register variables        116700.0  7681230.5      658.2
Double-Precision Whetstone                      55.0     1723.9      313.4
Execl Throughput                                43.0     3265.8      759.5
File Copy 1024 bufsize 2000 maxblocks         3960.0   193918.0      489.7
File Copy 256 bufsize 500 maxblocks           1655.0    63498.0      383.7
File Copy 4096 bufsize 8000 maxblocks         5800.0   447108.0      770.9
Pipe Throughput                              12440.0  1232909.3      991.1
Process Creation                               126.0     7975.6      633.0
Shell Scripts (8 concurrent)                     6.0     1690.6     2817.7
System Call Overhead                         15000.0  1744429.8     1163.0
                                                                 =========
     FINAL SCORE                                                     736.0

Now maybe the more interesting test is to see how mpi's pi benchmark 
performs
compared to a standard beowulf cluster:

On a scyld-cluster with 4 dual opteron 242 (1.6Ghz) compute nodes, it 
completed in 19 seconds:

mpirun -np 8 -no-local ./pi_MPI.g77
 pi is  3.14159265Error is  3.02335934E-12
 time is   18.650441 seconds

On the quad-dual-core opteron system (4xopteron 875) roughly equivalent 
to 8x248 (2.2Ghz) it
takes about 8 seconds:

[root at eng223 root]# /opt/mpich/bin/mpirun -np 8 ./pi_MPI.g77
 pi is  3.14159265Error is  3.02335934E-12
 time is   7.45223284 seconds

Now how to make a valid comparison between an SMP with higher clockspeed 
and a cluster with
lower clockspeed is quite a challenge. Assume we correct for the 37.5% 
faster clockspeed by
multiplying the runtime of the faster CPU by 1.375, we get 10 seconds 
runtime, which is still almost
twice as fast still by avoiding having to go through gigabit ethernet.

So I can see how your chess application could benefit, as long as you 
don't need more than those
8 CPUs.

Michael

 Vincent Diepeveen wrote:

>At 08:44 AM 6/23/2005 +0100, John Hearns wrote:
>  
>
>>On Wed, 2005-06-22 at 08:46 +0100, Mark Westwood wrote:
>>
>>    
>>
>>>My thinking is that mixed-mode programming, in which a code uses MPI for 
>>>inter-node communication and shared-memory programming (eg OpenMP) for 
>>>intra-node communication, is not worth the effort when you only have 2 
>>>CPUs in each node.  In fact, my thinking is that it's not even worth the 
>>>time experimenting to gather evidence.  I guess that makes me prejudiced 
>>>against mixed-mode programming on a typical COTS cluster.  Now, if you 
>>>were to offer me, say, 8 or 16 CPUs per node, I might think again.  Or 
>>>indeed if I am shot down in flames by contributors to this list who have 
>>>done the experiments ...
>>>
>>>      
>>>
>>Mark, that is very well put.
>>
>>May I add that 8 or 16 CPUs per node has become a realistic possibility?
>>Four and eight way(*) Opteron machines are available, up to 128Gbytes
>>RAM.
>>    
>>
>
>the 4 way quadboards are decently priced and all you need is 16 dimms (of
>course you want maximum speed out of the quad and fill it entirely up). If
>i remember well prices are far less than 2000 euro for such a quad board. 
>
>the 8 way according to rumours costs somewhere around 80000 euro.
>
>  
>
>>Expect to see more of these machines out there as engineering
>>workstations or for visualisation and animation/rendering.
>>
>>Add dual-core CPUs to the mix and you get a very attractive platform.
>>    
>>
>
>A big problem is that it's interesting to run at a quad, but please find me
>a cluster that consists out of quad dual cores.
>
>Usually clusters are slow intel dual xeons or something ugly bad from the
>past. Most organisations need also like 1 year between ordering and
>delivery of a cluster, by then such a processor is just so so so outdated
>that it's factor 2 times slower or so than the latest cpu on the market. 
>
>A small cluster consisting of those slow nodes just can't compete with a
>quad for algorithms that profit a lot from fast communication speeds that
>within a single mainboard happens.
>
>Example now is the many myri clusters there are at uni's of like 3Ghz P4
>single cpu @ 64 nodes.
>
>I could get such clusters for world champs 2005 which happen from 13-20
>august 2005.
>
>A single quad opteron dual core just outpowers such clusters *dramatically*
>for my chess software.
>
>This apart from the fact that you can easily test your software at such a
>quad without problems and without other users bugging you.
>
>The difference is too huge in processing power.
>
>
>
>  
>
>>(*) two quad motherboards, so I guess realistically and 8-way OpenMP is
>>the limit
>>    
>>
>
>
>
>  
>
>>_______________________________________________
>>Beowulf mailing list, Beowulf at beowulf.org
>>To change your subscription (digest mode or unsubscribe) visit
>>    
>>
>http://www.beowulf.org/mailman/listinfo/beowulf
>  
>
>>    
>>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>  
>