[Beowulf] [gorelsky at stanford.edu:CCL:dual-coreOpteron275performance]

Vincent Diepeveen diep at xs4all.nl
Wed Jul 13 11:50:45 PDT 2005

Hello Mikhail,

AFAIK 2.4 kernels (except some SGI patched ones) do not have much of a NUMA
support for dual opterons.

Is the software doing this (which is how you optimize for numa):
  - each processor starts, allocates its own shared memory,
    puts data in THAT shared memory (allocating without writing doesn't
    make sense as data gets allocated practically at moment of writing),
    then attaches to other processors shared memory and all 4 cpu's can
    write in each others ram. Each cpu then goes calculate in its own
    memory. Cores goes idemdito. What matters is that core X at 
    memory controller Y should not eat data from memory controller Z,
    as that slows down things significantly.

Is that how the software works?

In that case you should be getting close to a 4.0 scaling, 
somewhere in the 3.9x when using kernel 2.6.x with NUMA turned on.

Diep is doing exactly the above and gets 3.93 scaling at quad opteron, 
and 3.92 at dual opteron dual core (4 cores in total). See sudhian.com
for the accurate test of it and also the poor scaling of the P4 dual core.

Best regards,

At 08:31 PM 7/13/2005 +0400, Mikhail Kuzminsky wrote:
>In message from Alan Louis Scheinine <scheinin at crs4.it> (Tue, 12 Jul 
>2005 12:24:27 +0200):
>>  1) Gerry Creager wrote "Hoowa!"
>>     Since the results seem useful, I would like to add the 
>>     On dual-CPU boards with Athlon32 CPUs, the program "bolam" was 
>>slow if
>>     both CPUs on the board were used, it was better to have one 
>>MPICH process
>>     per compute node.  This problem did not appear in another 
>>cluster that had
>>     Opteron dual-CPU boards (single-core), that is, two processes 
>>for each node
>>     did not cause a slowdown.  This is an indication that "bolam" is 
>>at a
>>     threshold for memory access being a bottleneck. 
>The original post by S.Gorelsky (re-sent by E.Leitl) was about good
>scalability of 4cores/dual-CPUs Opteron 275 server on Gaussian 03 
>DFT/test397 test. I'm testing just now like Supermicro server 
>w/2*Opteron 275 but w/DDR333 instead of DDR400 used by S.Gorelsky.
>I used SuSE 9.0 w/2.4.21 kernel.
>I understood, that original results of S.Gorelsky were obtained, 
>for shared memory parallelization ! If I use G03 w/Linda (which
>is main parallelization tool for G03 - parallelization in shared
>memory model of G03 is available only for more restricted subset
>of quantum-chemical methods) - then the results are much more bad.
>On 4 cores I obtained speedup only 2.95 for Linda vs 3.6 for
>shared memory. The difference is, as I understand, simple because
>of data exchanges through RAM for the case of Linda; in shared memory
>model like memory traffic is absent.
>FYI: speedup by S.Gorelsky for 4 CPUs is 3.4 (hope that I calculated
>properly :-)).
>I also obtained similar results for other quantum-chemical methods 
>which show that using of Linda/G03 may give bad scalability for
>dual-core Opteron. 
>We also have some (developing by us) quantum-chemical application 
>is bandwidth-limited under parallelization, and using of 1 CPU (1 MPI 
>process) per dual Xeon nodes for Myrinet/MPICH is strongly preferred. 
>In the case of (dual single core CPUs)-Opteron nodes the situation is 
>But now for 4cores/2CPUs per Opteron node to force the using of
>only 2 cores (from 4), by 1 for each chip, we'll need to have
>cpu affinity support in Linux.
>> A complication 
>>for this
>>     interpretation is that the Athlon32 nodes use Linux kernel 
>>  2) Mikhail Kuzminsky asked "do you have "node interleave memory" 
>>switched off?
>>     Reading the BIOS:
>>     Bank interleaving "Auto", there are two memory modules per CPU 
>>so there
>>        should be bank interleaving.
>>     Node interleaving "Disable"
>>  3) In an email Guy Coates asked
>>     > Did you need to use numa-tools to specify the CPU placement, 
>>or did the
>>     > kernel "do the right thing" by itself?
>>     The kernel did the right thing by itself.
>>     I have a question: what are numa-tools?
>>     On the computer I find
>>     man -k numa
>>        numa   (3)  - NUMA policy library
>>        numactl(8)  - Control NUMA policy for processes or shared 
>>     rpm -qa | grep -i numa
>>        numactl-0.6.4-1.13
>>     Is numactl the "numa-tools"?  Is there another package to 
>>consider installing?
>>     I see that numactl has many "man" pages.
>>Reference, previous message:
>> >In all cases, 4 MPI processes on a machine with 4 cores (two 
>>dual-core CPUs).
>> >Meteorology program 1, "bolam"    CPU time, real time (in seconds)
>> >      Linux kernel 2.6.9-11.ELsmp     122        128
>> >      Linux kernel            64         77
>> >
>> >Meteorology program 2, "non-hydrostatic"
>> >      Linux kernel 2.6.9-11.ELsmp     598        544
>> >      Linux kernel           430        476
>>  Centro di Ricerca, Sviluppo e Studi Superiori in Sardegna
>>  Center for Advanced Studies, Research, and Development in Sardinia
>>  Postal Address:               |  Physical Address for FedEx, UPS, 
>>  ---------------               | 
>> -------------------------------------
>>  Alan Scheinine                |  Alan Scheinine
>>  c/o CRS4                      |  c/o CRS4
>>  C.P. n. 25                    |  Loc. Pixina Manna Edificio 1
>>  09010 Pula (Cagliari), Italy  |  09010 Pula (Cagliari), Italy
>>  Email: scheinin at crs4.it
>>  Phone: 070 9250 238  [+39 070 9250 238]
>>  Fax:   070 9250 216 or 220  [+39 070 9250 216 or +39 070 9250 220]
>>  Operator at reception: 070 9250 1  [+39 070 9250 1]
>>  Mobile phone: 347 7990472  [+39 347 7990472]
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit

More information about the Beowulf mailing list