[Beowulf] Q: AMD Opteron (Barcelona) 2356 vs Intel Xeon 5460

Wed Sep 17 22:35:55 PDT 2008

The scientific application used is Dl-Poly - 2.17.

Tested with Pathscale and Intel compilers on AMD Opteron Quad core. The time
figures mentioned were taken from DL-Poly output file. Also I had used time
command. Here are the results:

                      AMD-2.3GHz (32 GB RAM)
    INTEL-2.33GHz (32 GB RAM)

                         GNU gfortran      Pathscale      Intel 10
ifort                      Intel 10 fiort

1. Serial

OUTPUT file       147.719 sec       158.158 sec     135.729 sec
                     73.952 sec

Time command    2m27.791s
2m38.268s                                              1m13.972s

2. Parallel
      4 core

OUTPUT file         39.798 sec           44.717 sec        36.962 sec
          32.317 sec

Time Command     0m41.527s
0m46.571s                                       0m36.218s

3. Parallel
      8 core

OUTPUT               26.880 sec             33.746 sec       27.979 sec
               30.371 sec

Time cmd
0m30.171s

The optimization flags used:

Intel ifort 10:        -O3  -axW  -funroll-loops  (don't remember exact
flag. Similar to loop unroll)

Pathscale:          -O3  -OPT:Ofast   -ffast-math      -fno-math-errno

GNU gfortran      -O3   -ffast-math -funroll-all-loops  -ftree-vectorize

I'll try to use the further: http://directory.fsf.org/project/time/

Thanks,
Sangamesh

On Thu, Sep 18, 2008 at 6:07 AM, Vincent Diepeveen <diep at xs4all.nl> wrote:

> How does all this change when you use a PGO optimized executable on both
> sides?
>
> Vincent
>
>
> On Sep 18, 2008, at 2:34 AM, Eric Thibodeau wrote:
>
>  Vincent Diepeveen wrote:
>>
>>> Nah,
>>>
>>> I guess he's referring to sometimes it's using single precision floating
>>> point
>>> to get something done instead of double precision, and it tends to keep
>>> sometimes stuff in registers.
>>>
>>> That isn't a problem necessarily, but if i remember well floating point
>>> state
>>> could get wiped out when switching to SSE2.
>>>
>>> Sometimes you lose your FPU registerset in that case.
>>>
>>> Main problem is that there is so many dangerous optimizations possible,
>>> to speedup testsets, because in itself floating point is real slow to do
>>> at hardware,
>>> from hardware viewpoint seen.
>>>
>>> Yet in general last generations of intel compilers that has improved
>>> really a lot.
>>>
>> Well, running the same code here is the result discrepancy I got:
>> FLOPS:
>>   my code has to do: 7,975,847,125,000 (~8Tflops) ...takes 15minutes on
>> 8*2core Opeteron with 32 Gigs-o-RAM (thank you OpenMP ;)
>>
>> The running times (ran it a _few_ times...but not the statistical minimum
>> of 30):
>>   ICC -> runtime == 689.249  ; summed error == 1651.78
>>   GCC -> runtime == 1134.404 ; summed error == 0.883501
>>
>> Compiler Flags:
>>   icc -xW -openmp -O3 vqOpenMP.c -o vqOpenMP
>>   gcc -lm -fopenmp -O3 -march=native vqOpenMP.c -o vqOpenMP_GCC
>>
>> No trickery, no smoky mirrors ;) Just a _huge_ kick ASS k-Means
>> parallelized with OpenMP (thank gawd, otherwise it takes hours to run) and a
>> rather big database of 1.4 Gigs
>>
>> ... So this is what I meant by floating point errors. Yes, the runtime was
>> almost halved by ICC (and this is on an *opteron* based system, Tyan VX50).
>> The running time wasn't what I was actually looking for rather than
>> precision skew and that's where I fell off my chair.
>>
>> For the ones itching for a little more specs:
>>
>> eric at einstein ~ $ icc -V
>> Intel(R) C Compiler for applications running on Intel(R) 64, Version 10.1
>>    Build 20080602
>> Copyright (C) 1985-2008 Intel Corporation.  All rights reserved.
>> FOR NON-COMMERCIAL USE ONLY
>>
>> eric at einstein ~ $ gcc -v
>> Using built-in specs.
>> Target: x86_64-pc-linux-gnu
>> Configured with:
>> /dev/shm/portage/sys-devel/gcc-4.3.1-r1/work/gcc-4.3.1/configure
>> --prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/4.3.1
>> --includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.3.1/include
>> --datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.3.1
>> --mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.3.1/man
>> --infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.3.1/info
>> --with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.3.1/include/g++-v4
>> --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --disable-altivec
>> --enable-nls --without-included-gettext --with-system-zlib
>> --disable-checking --disable-werror --enable-secureplt --enable-multilib
>> --enable-libmudflap --disable-libssp --enable-cld --disable-libgcj
>> --enable-languages=c,c++,treelang,fortran --enable-shared
>> --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu
>> --with-bugurl=http://bugs.gentoo.org/ --with-pkgversion='Gentoo 4.3.1-r1
>> p1.1'
>> Thread model: posix
>> gcc version 4.3.1 (Gentoo 4.3.1-r1 p1.1)
>>
>>>
>>> Vincent
>>>
>>> On Sep 17, 2008, at 10:25 PM, Greg Lindahl wrote:
>>>
>>>  On Wed, Sep 17, 2008 at 03:43:36PM -0400, Eric Thibodeau wrote:
>>>>
>>>>  Also, note that I've had issues with icc
>>>>> generating really fast but inaccurate code (fp model is not IEEE *by
>>>>> default*, I am sure _everyone_ knows this and I am stating the obvious
>>>>> here).
>>>>>
>>>>
>>>> All modern, high-performance compilers default that way. It's certainly
>>>> the case that sometimes it goes more horribly wrong than necessary, but
>>>> I wouldn't ding icc for this default. Compare results with IEEE mode.
>>>>
>>>> -- greg
>>>>
>>>>
>>
>>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20080918/8cb6d74c/attachment.html>