[Beowulf] Peformance penalty when using 128-bit reals on AMD64

Fri Jun 25 09:01:34 PDT 2010

Prentice,
As was said before, I don't believe that x64 processor architectures support
128 bit precision instructions either (I did glance through the official AMD
manuals, and I've read the first 3 in the set for another project, and I
can't recall anything about operating on variables that large; storing
values of that precision, yes, but not multiplying and storing the results
in registers). The results would overflow the registers and then you'd have
to fall back on cache (which could be entirely doable, but you'd have to
code in assembler to ensure that (a) the results don't fall out of cache and
(b) that you are fetching the proper cache lines to obtain your results) or
main memory (which would once again involve coding in assembly language).
One way I think you might be able to do this is via some of the SIMD
multimedia instructions built into the processor. I only gave that volume of
the x64 (x86_64, AMD64, tomato-vs-tomato) manuals a cursory glance as that's
never been my concern, but I do believe that the processor architecture does
indeed support that level of precision and has the instructions to store the
rather large results in contiguous registers. Of course, I don't know what
this would do to your code.
I'd suggest 4 things :
1) Order a set of the AMD64 manuals (they used to be free, not sure now)
from AMD
2) Look at a cheap, brute force solution - I'd suggest SSD disks for swap,
perhaps (that's the most likely way I can think of the performance
degradation you're seeing happening - going out to swap - it's easy and
cheap to test on one system, and if it reduces it to a more acceptable wall
clock time then see if you can live with that)
3) Find a project that utilizes the CPU's performance counters and measure
exactly what is happening - it could be something quite simple that the
compiler is doing wrong and you can fix w/ a few flags or a little bit of
inline assembly code (I'm no FORTRAN programmer, but whatever standard
you're using should support it if the compiler does, and most of them
do)...I haven't done this in quite a while, perfctr used to be the standard.
What's the current Linux best-practice standard?
4) Start investigating other solutions in terms of CPU/GPU solutions (if
it's that important)

That's my $0.02 USD that I can add to this discussion on very little sleep,
I'll mail you if further inspiration hits with more espresso. I hope it
helps. And I can't really comment of the feasibility of GMP libraries as
I've never used them.
Regards,
Derek R.

On Fri, Jun 25, 2010 at 9:28 AM, Prentice Bisbal <prentice at ias.edu> wrote:

> Beowulfers,
>
> One of my Fortran programmers had to increase the precision of his
> program so he switched from REAL*8 to REAL*16 which changes the size of
> his variables from 64 bits to 128 bits. The program now takes 32x longer
> to run.
>
> I'm not an expert on processor archtitecture, etc., but I do know that
> once the size of a variable exceeds the size of the processors
> registers, things will slow down considerably. Is his 32x performance
> degradation in line with this?
>
> Is there any way to reduce this degradation? Would The GNU GMP library
> (or some other library) help speed things up?
>
>
> --
> Prentice
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100625/367ed19c/attachment.html>