[Beowulf] Shared memory
Michael Will
mwill at penguincomputing.com
Thu Jun 23 09:10:21 PDT 2005
I was just yesterday benchmarking our A3400 quad-opteron with dual cores
using UnixBench 4.1 which is not really an SMP benchmark except for the
8 and 16-concurrent shell script runs, and was not too impressed with
the speed
increase of those runs either, judging how much more the CPUs cost.
Compare A3140 (dual opteron 248 single core) with A3400 (quad opteron
875 dual core):
A3150/raid5 dual opteron 248 8G FC3 668 859 443
A1300 dual opteron 852 4G FC3 806 964 497
A1300 dual opteron 875 4G RHEL3u5 724 1329 744
A3400 quad opteron 875 32G RHEL3u5 736 1691 1030
A3150/raid5, dual opteron 248, 8G, FC3
BYTE UNIX Benchmarks (Version 4.1.0)
System -- Linux load157.load.penguincomputing.com 2.6.10-1.770_FC3smp #1 SMP Thu Feb 24 18:36:43 EST 2005 x86_64 x86_64 x86_64 GNU/Linux
Start Benchmark Run: Wed Jun 22 16:51:09 PDT 2005
2 interactive users.
16:51:09 up 5 days, 17:38, 2 users, load average: 0.00, 0.00, 0.00
lrwxrwxrwx 1 root root 4 Jun 16 02:18 /bin/sh -> bash
/bin/sh: symbolic link to `bash'
1041184892 8447692 979848012 1% /
Dhrystone 2 using register variables 7143742.6 lps (10.0 secs, 10 samples)
Double-Precision Whetstone 1861.4 MWIPS (10.0 secs, 10 samples)
System Call Overhead 1754780.4 lps (10.0 secs, 10 samples)
Pipe Throughput 637803.6 lps (10.0 secs, 10 samples)
Pipe-based Context Switching 99822.9 lps (10.0 secs, 10 samples)
Process Creation 8705.8 lps (30.0 secs, 3 samples)
Execl Throughput 3331.6 lps (30.0 secs, 3 samples)
File Read 1024 bufsize 2000 maxblocks 686255.0 KBps (30.0 secs, 3 samples)
File Write 1024 bufsize 2000 maxblocks 323633.0 KBps (30.0 secs, 3 samples)
File Copy 1024 bufsize 2000 maxblocks 221495.0 KBps (30.0 secs, 3 samples)
File Read 256 bufsize 500 maxblocks 258891.0 KBps (30.0 secs, 3 samples)
File Write 256 bufsize 500 maxblocks 99274.0 KBps (30.0 secs, 3 samples)
File Copy 256 bufsize 500 maxblocks 68148.0 KBps (30.0 secs, 3 samples)
File Read 4096 bufsize 8000 maxblocks 1396802.0 KBps (30.0 secs, 3 samples)
File Write 4096 bufsize 8000 maxblocks 764279.0 KBps (30.0 secs, 3 samples)
File Copy 4096 bufsize 8000 maxblocks 468049.0 KBps (30.0 secs, 3 samples)
Shell Scripts (1 concurrent) 3968.3 lpm (60.0 secs, 3 samples)
Shell Scripts (8 concurrent) 858.7 lpm (60.0 secs, 3 samples)
Shell Scripts (16 concurrent) 443.3 lpm (60.0 secs, 3 samples)
Arithmetic Test (type = short) 451669.6 lps (10.0 secs, 3 samples)
Arithmetic Test (type = int) 461532.8 lps (10.0 secs, 3 samples)
Arithmetic Test (type = long) 265095.8 lps (10.0 secs, 3 samples)
Arithmetic Test (type = float) 905143.8 lps (10.0 secs, 3 samples)
Arithmetic Test (type = double) 898198.1 lps (10.0 secs, 3 samples)
Arithoh 9944308.6 lps (10.0 secs, 3 samples)
C Compiler Throughput 1264.0 lpm (60.0 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places 118607.5 lpm (30.0 secs, 3 samples)
Recursion Test--Tower of Hanoi 144138.2 lps (20.0 secs, 3 samples)
INDEX VALUES
TEST BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 7143742.6 612.1
Double-Precision Whetstone 55.0 1861.4 338.4
Execl Throughput 43.0 3331.6 774.8
File Copy 1024 bufsize 2000 maxblocks 3960.0 221495.0 559.3
File Copy 256 bufsize 500 maxblocks 1655.0 68148.0 411.8
File Copy 4096 bufsize 8000 maxblocks 5800.0 468049.0 807.0
Pipe Throughput 12440.0 637803.6 512.7
Process Creation 126.0 8705.8 690.9
Shell Scripts (8 concurrent) 6.0 858.7 1431.2
System Call Overhead 15000.0 1754780.4 1169.9
=========
FINAL SCORE 668.0
A3400, quad 875 (4x2cores), RHEL3ASu5 (2.4.21-32.ELsmp kernel) 32G
PC2700 RAM
BYTE UNIX Benchmarks (Version 4.1.0)
System -- Linux eng223.eng.penguincomputing.com 2.4.21-32.ELsmp #1 SMP Fri Apr 15 21:03:28 EDT 2005 x86_64 x86_64 x86_64 GNU/Linux
Start Benchmark Run: Wed Jun 22 16:52:16 PDT 2005
4 interactive users.
16:52:16 up 36 min, 4 users, load average: 0.50, 1.69, 3.36
lrwxrwxrwx 1 root root 4 Jun 22 14:17 /bin/sh -> bash
/bin/sh: symbolic link to bash
/dev/sda2 32834548 5628576 25538024 19% /
Dhrystone 2 using register variables 7681230.5 lps (10.0 secs, 10 samples)
Double-Precision Whetstone 1723.9 MWIPS (9.9 secs, 10 samples)
System Call Overhead 1744429.8 lps (10.0 secs, 10 samples)
Pipe Throughput 1232909.3 lps (10.0 secs, 10 samples)
Pipe-based Context Switching 152714.6 lps (10.0 secs, 10 samples)
Process Creation 7975.6 lps (30.0 secs, 3 samples)
Execl Throughput 3265.8 lps (29.4 secs, 3 samples)
File Read 1024 bufsize 2000 maxblocks 944176.0 KBps (30.0 secs, 3 samples)
File Write 1024 bufsize 2000 maxblocks 251362.0 KBps (30.0 secs, 3 samples)
File Copy 1024 bufsize 2000 maxblocks 193918.0 KBps (30.0 secs, 3 samples)
File Read 256 bufsize 500 maxblocks 510216.0 KBps (30.0 secs, 3 samples)
File Write 256 bufsize 500 maxblocks 80901.0 KBps (30.0 secs, 3 samples)
File Copy 256 bufsize 500 maxblocks 63498.0 KBps (30.0 secs, 3 samples)
File Read 4096 bufsize 8000 maxblocks 1517443.0 KBps (30.0 secs, 3 samples)
File Write 4096 bufsize 8000 maxblocks 634337.0 KBps (30.0 secs, 3 samples)
File Copy 4096 bufsize 8000 maxblocks 447108.0 KBps (30.0 secs, 3 samples)
Shell Scripts (1 concurrent) 3787.2 lpm (60.0 secs, 3 samples)
Shell Scripts (8 concurrent) 1690.6 lpm (60.0 secs, 3 samples)
Shell Scripts (16 concurrent) 1030.2 lpm (60.0 secs, 3 samples)
Arithmetic Test (type = short) 445516.1 lps (10.0 secs, 3 samples)
Arithmetic Test (type = int) 454972.9 lps (10.0 secs, 3 samples)
Arithmetic Test (type = long) 273406.6 lps (10.0 secs, 3 samples)
Arithmetic Test (type = float) 982122.0 lps (10.0 secs, 3 samples)
Arithmetic Test (type = double) 976282.9 lps (10.0 secs, 3 samples)
Arithoh 10019944.0 lps (10.0 secs, 3 samples)
C Compiler Throughput 1326.2 lpm (60.0 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places 128918.6 lpm (30.0 secs, 3 samples)
Recursion Test--Tower of Hanoi 136001.1 lps (20.0 secs, 3 samples)
INDEX VALUES
TEST BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 7681230.5 658.2
Double-Precision Whetstone 55.0 1723.9 313.4
Execl Throughput 43.0 3265.8 759.5
File Copy 1024 bufsize 2000 maxblocks 3960.0 193918.0 489.7
File Copy 256 bufsize 500 maxblocks 1655.0 63498.0 383.7
File Copy 4096 bufsize 8000 maxblocks 5800.0 447108.0 770.9
Pipe Throughput 12440.0 1232909.3 991.1
Process Creation 126.0 7975.6 633.0
Shell Scripts (8 concurrent) 6.0 1690.6 2817.7
System Call Overhead 15000.0 1744429.8 1163.0
=========
FINAL SCORE 736.0
Now maybe the more interesting test is to see how mpi's pi benchmark
performs
compared to a standard beowulf cluster:
On a scyld-cluster with 4 dual opteron 242 (1.6Ghz) compute nodes, it
completed in 19 seconds:
mpirun -np 8 -no-local ./pi_MPI.g77
pi is 3.14159265Error is 3.02335934E-12
time is 18.650441 seconds
On the quad-dual-core opteron system (4xopteron 875) roughly equivalent
to 8x248 (2.2Ghz) it
takes about 8 seconds:
[root at eng223 root]# /opt/mpich/bin/mpirun -np 8 ./pi_MPI.g77
pi is 3.14159265Error is 3.02335934E-12
time is 7.45223284 seconds
Now how to make a valid comparison between an SMP with higher clockspeed
and a cluster with
lower clockspeed is quite a challenge. Assume we correct for the 37.5%
faster clockspeed by
multiplying the runtime of the faster CPU by 1.375, we get 10 seconds
runtime, which is still almost
twice as fast still by avoiding having to go through gigabit ethernet.
So I can see how your chess application could benefit, as long as you
don't need more than those
8 CPUs.
Michael
Vincent Diepeveen wrote:
>At 08:44 AM 6/23/2005 +0100, John Hearns wrote:
>
>
>>On Wed, 2005-06-22 at 08:46 +0100, Mark Westwood wrote:
>>
>>
>>
>>>My thinking is that mixed-mode programming, in which a code uses MPI for
>>>inter-node communication and shared-memory programming (eg OpenMP) for
>>>intra-node communication, is not worth the effort when you only have 2
>>>CPUs in each node. In fact, my thinking is that it's not even worth the
>>>time experimenting to gather evidence. I guess that makes me prejudiced
>>>against mixed-mode programming on a typical COTS cluster. Now, if you
>>>were to offer me, say, 8 or 16 CPUs per node, I might think again. Or
>>>indeed if I am shot down in flames by contributors to this list who have
>>>done the experiments ...
>>>
>>>
>>>
>>Mark, that is very well put.
>>
>>May I add that 8 or 16 CPUs per node has become a realistic possibility?
>>Four and eight way(*) Opteron machines are available, up to 128Gbytes
>>RAM.
>>
>>
>
>the 4 way quadboards are decently priced and all you need is 16 dimms (of
>course you want maximum speed out of the quad and fill it entirely up). If
>i remember well prices are far less than 2000 euro for such a quad board.
>
>the 8 way according to rumours costs somewhere around 80000 euro.
>
>
>
>>Expect to see more of these machines out there as engineering
>>workstations or for visualisation and animation/rendering.
>>
>>Add dual-core CPUs to the mix and you get a very attractive platform.
>>
>>
>
>A big problem is that it's interesting to run at a quad, but please find me
>a cluster that consists out of quad dual cores.
>
>Usually clusters are slow intel dual xeons or something ugly bad from the
>past. Most organisations need also like 1 year between ordering and
>delivery of a cluster, by then such a processor is just so so so outdated
>that it's factor 2 times slower or so than the latest cpu on the market.
>
>A small cluster consisting of those slow nodes just can't compete with a
>quad for algorithms that profit a lot from fast communication speeds that
>within a single mainboard happens.
>
>Example now is the many myri clusters there are at uni's of like 3Ghz P4
>single cpu @ 64 nodes.
>
>I could get such clusters for world champs 2005 which happen from 13-20
>august 2005.
>
>A single quad opteron dual core just outpowers such clusters *dramatically*
>for my chess software.
>
>This apart from the fact that you can easily test your software at such a
>quad without problems and without other users bugging you.
>
>The difference is too huge in processing power.
>
>
>
>
>
>>(*) two quad motherboards, so I guess realistically and 8-way OpenMP is
>>the limit
>>
>>
>
>
>
>
>
>>_______________________________________________
>>Beowulf mailing list, Beowulf at beowulf.org
>>To change your subscription (digest mode or unsubscribe) visit
>>
>>
>http://www.beowulf.org/mailman/listinfo/beowulf
>
>
>>
>>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
>
More information about the Beowulf
mailing list