[Beowulf] CCL:Question regarding Mac G5 performance
Joe Landman
landman at scalableinformatics.com
Mon May 24 15:40:45 PDT 2004
Konstantin Kudin wrote:
>--- Joe Landman <landman at scalableinformatics.com> wrote:
>
>
>
>>> It is unlikely that one can gain much speed from going to 64 bits,
>>>
>>>
>>but
>>
>>
>>>the support for larger memory and unlimited scratch files is very
>>>worthwhile in itself.
>>>
>>>
>>I have seen in md43, moldy, and a few others, about 20-30% under gcc
>>recompilation with -m64 on Opteron. For informatics codes it was
>>about the same.
>>
>>
>
> Well, bioinformatics codes presumably run mostly integer operations.
>This is very different from heavy floating point calculations in G03
>which are double precision in all architectures and run with
>approximately the same speed regardless of 32/64 bitness.
>
>
md43 and moldy are molecular dynamics codes. I had been thinking of
running some tests with GAMESS, or some other codes just so I have a
sense of how electronic structure codes do on the system. For
computationally intensive codes, going to 64 bits on the Opteron gets you
a) double the number of general purpose registers,
b) double the number of SSE registers.
This means that the optimizer, if it is dealing with a heavy register
pressure code, can do a better job of scheduling the resources. It also
means that some codes may be able to leverage more instructions in
flight per cycle because of resource availability. The address space is
also flat as compared to the segmented space of the 32 bit mode. It is
most definitely not a simple case of there being just a 32 vs 64 bit
address space. That advantage is there, but it is not the only one.
One of the interesting side effects of the NUMA systems has to do with
memory bandwidth per CPU as you increase the number of CPUs in a node.
For a correctly populated system (e.g. memory evenly distributed among
the CPUs), each CPU has full bandwidth to its local memory, and an
additional latency hop to remote memory on the same node. If you stack
all the memory on a single CPU (as I have seen many folks do, then run
benchmarks, and report their results), you share memory bandwidth. In
this case, you get the sort of results we see occasionally reported
here. Similar results occur if you have a kernel (say an ancient one
like 2.4.18) that doesn't know much about NUMA and related scheduling.
>
>
>
>>Your mileage will vary of course, but I expect with Gaussian and
>>others that
>>overflow memory, the overall system design will be as important (if
>>not more so)
>>to the overall performance than the CPU architecture, unless you can
>>somehow
>>isolate the computation to never spill to disk.
>>
>>
>
> With G03, some types of jobs will mostly be compute bound, and others
>will be mostly I/O bound. This is reasonably trivial to predict
>beforehand. I've tested jobs which were compute bound because testing
>the other side of the equation is more difficult due to more factors.
>
> For I/O bound jobs a box with loads of RAM and fast sequential I/O is
>the best. Something like dual-quad Opteron with 16-32Gb of RAM and 2-4
>ATA disks with RAID0 (striping) is a good choice these days.
>
>
:) I might suggest the 3ware folks for their controllers. Just pick
your file systems and stripe width carefully.
Joe
More information about the Beowulf
mailing list