[Beowulf] Woodcrest Memory bandwidth
Peter Kjellstrom
cap at nsc.liu.se
Tue Aug 15 10:21:02 PDT 2006
On Tuesday 15 August 2006 17:25, Richard Walsh wrote:
> Mark Hahn wrote:
> >>> Good point which makes perfect sense to me.
> >>> Given that the theoretical maximum is actually 21.3 GB/s
> >>> the real maximum Triad number must be 21.3/3 = 7.1 GB/s.
> >
> > I don't get this - triad does two reads and one write.
> > if you don't use store-through ('nt' versions of mov),
> > then the write also implies a read for write-allocate
> > (filling the cache line).
> > without store-through, the peak theoretical number reported by
> > stream should be 3*peak/4. the 4 is because there are 3r+1w,
> > and the 3 because stream doesn't give credit for write-allocate.
>
> That looks right. So, one socket, with write allocate, >>should<< show:
>
> 10.5 GB/sec * .75 or 7.875 GBytes/sec
>
> and two sockets 15.75 GBytes/sec. The problem could be related
> to competitive/ineffective use of the shared L2 cache or a bottleneck
> in the North bridge. It would seem that a look at how the performance
> grows as you add cores within versus across sockets should reveal this.
here you go (dell 2950 with 8 modules and streams compiled with icc-9.1 -O3:
[root at tbox3 streamd]# hostname ; date ; for i in 1 2 3 4 5 ; do export
OMP_NUM_THREADS=$i ; ./streamd | egrep "Total memory re|Number of Th|Function
|Copy:|Scale:|Add:|Triad:"; done
tbox3
Fri Aug 11 17:59:22 CEST 2006
Total memory required = 457.8 MB.
Number of Threads requested = 1
Function Rate (MB/s) Avg time Min time Max time
Copy: 3945.5494 0.0812 0.0811 0.0813
Scale: 2914.9758 0.1098 0.1098 0.1099
Add: 3227.5618 0.1488 0.1487 0.1489
Triad: 3219.5307 0.1492 0.1491 0.1493
Total memory required = 457.8 MB.
Number of Threads requested = 2
Function Rate (MB/s) Avg time Min time Max time
Copy: 4324.2058 0.0741 0.0740 0.0742
Scale: 2999.9626 0.1068 0.1067 0.1069
Add: 3309.2733 0.1451 0.1450 0.1452
Triad: 3309.7031 0.1451 0.1450 0.1452
Total memory required = 457.8 MB.
Number of Threads requested = 3
Function Rate (MB/s) Avg time Min time Max time
Copy: 5422.5441 0.0590 0.0590 0.0590
Scale: 4102.8364 0.0780 0.0780 0.0781
Add: 4487.2464 0.1070 0.1070 0.1070
Triad: 4487.7465 0.1070 0.1070 0.1070
Total memory required = 457.8 MB.
Number of Threads requested = 4
Function Rate (MB/s) Avg time Min time Max time
Copy: 6023.2969 0.0532 0.0531 0.0533
Scale: 4862.4855 0.0658 0.0658 0.0659
Add: 5264.1973 0.0912 0.0912 0.0913
Triad: 5268.1782 0.0911 0.0911 0.0911
Total memory required = 457.8 MB.
Number of Threads requested = 5
Function Rate (MB/s) Avg time Min time Max time
Copy: 5504.9004 0.0582 0.0581 0.0582
Scale: 4318.9044 0.0786 0.0741 0.1147
Add: 4705.1016 0.1042 0.1020 0.1216
Triad: 4705.2885 0.1038 0.1020 0.1184
> Two cores on separate sockets should show higher numbers if it's
> an L2 cache issue. If they are the same as those for 2 cores on one
> socket then you have a problem with the North bridge or getting
> full bandwidth from the FB-DIMMs.
>
> A complication in this test could be that in the one core per socket case
> the whole L2 cache is allocated to a single core. Watching performance
> change as the array sizes grow should reveal this.
>
> rbw
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20060815/f80502be/attachment.sig>
More information about the Beowulf
mailing list