[Beowulf] Re: dual core Opteron performance - re suse 9.3

Vincent Diepeveen diep at xs4all.nl
Tue Jul 12 09:04:40 PDT 2005

Hi Don,

A few questions.

Did you use PGO (profile guided optimizations) with gcc 3.3.4 for your code?

PGO is broken in 3.3.4 for my software when i deterministically compare 1
cpu's output compiled with 3.3.4 + pgo. Did you deterministically compare
both executables with each other (when running single cpu) and see whether
output is 100% equal?

Note 3.3.4-suse gcc may not be 100% similar to 3.3.4 gcc. However same bug
is there in the 3.3.1-3.3.x series from Suse GCC.

In 4.0.0 the PGO works better and creates the same output, YMMV there.

At 10:36 AM 7/12/2005 -0500, Don Kinghorn wrote:
>Hi Vincent,  ...all,
>The code was built on a SuSE9.2 machine with gcc/g77 3.3.4. The same 
>executable was run on both systems.
>Kernel for the 2 dual-node setup was SuSE stock 2.6.8-24-smp
>for the 9.3 setup with the dual-core cpus it was the stock install kernel 

>Memory was fully populated on the 2 node setup -- 4 one GB modules per
>there are only 4 slots on the Tyan 2875 (I had mistakenly reported yesterday 

I'm not seeing anywhere at Tyan an indication this board can take advantage
of NUMA.
Looks like it that there is 1 shared memory, correct me if i'm wrong. It's
not showing the RAM as being working for 1 cpu, but rather for both.

>that there was only 2GB/per board for the benchmark numbers)
>The dual-core system had 4 one GB modules arranged 2 for each cpu.

So you compared a dual opteron dual core (non-tiger board)
with dual opteron (Tiger).

I assume you used at both machines 2 cpu's to compare speed of your code.

Currently setting up gentoo at quad.

>Important(?) bios settings were;
>Bank interleaving "Auto"
>Node interleaving "Auto"
>PowerNow "Disable"
>MemoryHole "Disabled" for both hardware and software settings
>The speedup we saw on the dual-core was less than 10% for the most jobs. MP2 
>jobs with heavy i/o (worst case) was around a %20 hit (there were twice as 
>many processes hitting the raid scratch space at the same time)

Are you speaking now of comparing a 4 core (dual opteron dual core) as
compared to a dual opteron tiger, which gave a 10% speedup for the added 2

That's an ugly speedup in that case, perhaps improve the code?
Excuses like memory controllers is not a good excuse. The 2 memory
controllers can deliver more data per second than the cpu's deliver gflop
per second.

As you can see at sudhian, diep has a speedup of 3.92 out of 4 cores.

Of course that was years of hard programming.

>I still have lots of testing and tuning to do. These tests were just to
see if 
>was going to work and how much trouble it was going to be. ( It was a LOT of 
>trouble getting SuSE9.3 installed but I think worth it in the end)

Setting up gentoo 2005.0 amd64 universal here now. Will go fine.

>Best to all
>> If you 'did get better performance', that's possibly because
>> you have some kernel 2.6.x now, allowing NUMA, and a new
>> compiler version of gcc like 4.0.1 that has been bugfixed more than
>> the very buggy series 3.3.x and 3.4.x
>> Can you show us the differences between the compiler versions and kernel
>> versions you had and whether it's NUMA?
>> Also how is your memory banks configured, for 64 bits usage or 128 bits
>> single cpu usage, or are all banks filled up?

>Dr. Donald B. Kinghorn Parallel Quantum Solutions LLC

More information about the Beowulf mailing list