[Beowulf] Nehalem memory configs

Mon Apr 13 12:02:28 PDT 2009

> On Behalf Of Joe Landman
> 
> Since the part is released, I can report a stream test :)

And so can I :-)    (below)

> 
> richard.walsh at comcast.net wrote:
> 
> > 64 GB/sec is the right dual-socket theoretical number for this
> > situation, and Intel
> > presents the value of 33 GB/sec for the stream triad for the dual
> > socket boards,
> >
> > so 35 GB/sec could be a copy perhaps, but nothing was mentioned about
> > any benchmark in the memory piece.  

The STREAM benchmark was mentioned in the delltechcenter piece, but which sub-benchmark (Triad or Copy, etc.) was not.
Here's some results we got on a Nehalem system with Dual 
Intel Xeon W5580  @ 3.20GHz CPUs,
6x 2GB DDR3-1333 dimms (one per memory channel),
and SMT turned off,
where all 4 STREAM components are over 37 GB/s when run on 8 threads over two CPUs:
------------------

OpenMP (8 threads)
Intel 11.0, icc -O3 -openmp -static
Array size = 32000000, Offset = 0
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       38705.2547       0.0134       0.0132       0.0135
Scale:      37735.3959       0.0137       0.0136       0.0138
Add:        37293.9249       0.0207       0.0206       0.0209
Triad:      37388.7235       0.0207       0.0205       0.0209

Serial
Intel 11.0,  icc -O3 -static
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       10781.6770       0.0475       0.0475       0.0475
Scale:      10080.7104       0.0508       0.0508       0.0508
Add:        12646.7882       0.0608       0.0607       0.0608
Triad:      12628.8395       0.0608       0.0608       0.0608

-------------------

The 3.2 GHz, W5580 part is for workstations.  We'll remeasure when we get some servers with somewhat slower CPUs, but I would not expect a big difference from the above.

-Tom Elken

> In any case,  I think  we have the
> > right theoretical
> >
> > and probable real-world numbers expressed here, if people were
> > wondering.
> 
> 2-socket Intel MB with 2 dual core (not quad core) Nehalem E5502 1.8
> GHz
> processors, running stream omp (I bumped N way up to get a reasonable
> measurement).
> 
> landman at velocibunny:~/stream$ ./stream_c_omp.exe
> -------------------------------------------------------------
> STREAM version $Revision: 5.8 $
> -------------------------------------------------------------
> This system uses 8 bytes per DOUBLE PRECISION word.
> -------------------------------------------------------------
> Array size = 200000000, Offset = 0
> Total memory required = 4577.6 MB.
> Each test is run 10 times, but only
> the *best* time for each is used.
> -------------------------------------------------------------
> Number of Threads requested = 4
> -------------------------------------------------------------
> Printing one line per active thread....
> Printing one line per active thread....
> Printing one line per active thread....
> Printing one line per active thread....
> -------------------------------------------------------------
> Your clock granularity/precision appears to be 1 microseconds.
> Each test below will take on the order of 130623 microseconds.
>     (= 130623 clock ticks)
> Increase the size of the arrays if this shows that
> you are not getting at least 20 clock ticks per test.
> -------------------------------------------------------------
> WARNING -- The above is only a rough guideline.
> For best results, please be sure you know the
> precision of your system timer.
> -------------------------------------------------------------
> Function      Rate (MB/s)   Avg time     Min time     Max time
> Copy:       16545.0680       0.1942       0.1934       0.1958
> Scale:      16098.2714       0.1996       0.1988       0.2019
> Add:        17929.8514       0.2684       0.2677       0.2697
> Triad:      17682.8117       0.2719       0.2715       0.2722
> -------------------------------------------------------------
> Solution Validates
> -------------------------------------------------------------
> 
> and for laughs, same test run (with same binary) on Shanghai 2.3 GHz
> (2376) with OMP_NUM_THREADS=4
> 
> 
> landman at pegasus-a3g:~/stream$ ./stream_c_omp.exe
> -------------------------------------------------------------
> STREAM version $Revision: 5.8 $
> -------------------------------------------------------------
> This system uses 8 bytes per DOUBLE PRECISION word.
> -------------------------------------------------------------
> Array size = 200000000, Offset = 0
> Total memory required = 4577.6 MB.
> Each test is run 10 times, but only
> the *best* time for each is used.
> -------------------------------------------------------------
> Number of Threads requested = 4
> -------------------------------------------------------------
> Printing one line per active thread....
> Printing one line per active thread....
> Printing one line per active thread....
> Printing one line per active thread....
> -------------------------------------------------------------
> Your clock granularity/precision appears to be 1 microseconds.
> Each test below will take on the order of 210029 microseconds.
>     (= 210029 clock ticks)
> Increase the size of the arrays if this shows that
> you are not getting at least 20 clock ticks per test.
> -------------------------------------------------------------
> WARNING -- The above is only a rough guideline.
> For best results, please be sure you know the
> precision of your system timer.
> -------------------------------------------------------------
> Function      Rate (MB/s)   Avg time     Min time     Max time
> Copy:       10885.6547       0.2943       0.2940       0.2946
> Scale:      10966.1188       0.2923       0.2918       0.2929
> Add:        12019.7420       0.4002       0.3993       0.4012
> Triad:      12127.1875       0.3965       0.3958       0.3968
> -------------------------------------------------------------
> Solution Validates
> -------------------------------------------------------------
> 
> I suspect we have the pegasus memory in a non-optimal config, will look
> later on next week.
> 
> Assuming we can get a pair of quad core Nehalem units into our test
> machine, it appears that 32 GB/s on stream is quite possible.  Right
> now
> it looks like ~4 GB/s per thread.
> 
> --
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics LLC,
> email: landman at scalableinformatics.com
> web  : http://www.scalableinformatics.com
>         http://jackrabbit.scalableinformatics.com
> phone: +1 734 786 8423 x121
> fax  : +1 866 888 3112
> cell : +1 734 612 4615
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf