[Beowulf] Nehalem memory configs
Joe Landman
landman at scalableinformatics.com
Sat Apr 11 11:43:42 PDT 2009
Since the part is released, I can report a stream test :)
richard.walsh at comcast.net wrote:
> 64 GB/sec is the right dual-socket theoretical number for this
> situation, and Intel
>
> presents the value of 33 GB/sec for the stream triad for the dual
> socket boards,
>
> so 35 GB/sec could be a copy perhaps, but nothing was mentioned about
> any
>
> benchmark in the memory piece. In any case, I think we have the
> right theoretical
>
> and probable real-world numbers expressed here, if people were
> wondering.
2-socket Intel MB with 2 dual core (not quad core) Nehalem E5502 1.8 GHz
processors, running stream omp (I bumped N way up to get a reasonable
measurement).
landman at velocibunny:~/stream$ ./stream_c_omp.exe
-------------------------------------------------------------
STREAM version $Revision: 5.8 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 200000000, Offset = 0
Total memory required = 4577.6 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 4
-------------------------------------------------------------
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 130623 microseconds.
(= 130623 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 16545.0680 0.1942 0.1934 0.1958
Scale: 16098.2714 0.1996 0.1988 0.2019
Add: 17929.8514 0.2684 0.2677 0.2697
Triad: 17682.8117 0.2719 0.2715 0.2722
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
and for laughs, same test run (with same binary) on Shanghai 2.3 GHz
(2376) with OMP_NUM_THREADS=4
landman at pegasus-a3g:~/stream$ ./stream_c_omp.exe
-------------------------------------------------------------
STREAM version $Revision: 5.8 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 200000000, Offset = 0
Total memory required = 4577.6 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 4
-------------------------------------------------------------
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 210029 microseconds.
(= 210029 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 10885.6547 0.2943 0.2940 0.2946
Scale: 10966.1188 0.2923 0.2918 0.2929
Add: 12019.7420 0.4002 0.3993 0.4012
Triad: 12127.1875 0.3965 0.3958 0.3968
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
I suspect we have the pegasus memory in a non-optimal config, will look
later on next week.
Assuming we can get a pair of quad core Nehalem units into our test
machine, it appears that 32 GB/s on stream is quite possible. Right now
it looks like ~4 GB/s per thread.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
More information about the Beowulf
mailing list