[Beowulf] Question about amd64 architecture and floating point operations

Fri Nov 24 06:35:58 PST 2006

>>> A common confusion ... x86_64 changes nothing about the precision of 
>>> floats or doubles in
>>> C or Fortran.
>> 
>> well, sort of.  it was pretty common to find at least some computations
>> in ia32 using 80b FP, intentionally or not.  but iirc in long mode
>> (colloquially x86_64), you no longer get x87 access.
> An important internal detail.  My "nothing" above was assigned to the program 
> level
> and the computable epsilons.  Your point is that in long mode because you 
> cannot use
> the x87 FPU there is a potential difference internally--no 80-bit versus 
> possibly some--
> Oui?

I had the impression that in (pure) 64b mode, one couldn't use the legacy x87
instructions.  this doesn't seem to be the case, though - but the amd doc
(6.1.2 of AMD64 prog man v1) says that x87 codes have to be recompiled.
for kicks, I compiled the following function using pathscale under x86_64
with and without -m32:

double foo(long double a, long double b) {
     long double c = a * b;
     return c;
}

m32:
    0:   83 c4 ec                add    $0xffffffec,%esp
    3:   db 6c 24 24             fldt   0x24(%esp)
    7:   db 6c 24 18             fldt   0x18(%esp)
    b:   de c9                   fmulp  %st,%st(1)
    d:   dd 5c 24 00             fstpl  0x0(%esp)
   11:   66 0f 12 44 24 00       movlpd 0x0(%esp),%xmm0
   17:   f2 0f 11 44 24 08       movsd  %xmm0,0x8(%esp)
   1d:   dd 44 24 08             fldl   0x8(%esp)
   21:   83 c4 14                add    $0x14,%esp
   24:   c3                      ret

x86_64:
    0:   48 83 c4 e8             add    $0xffffffffffffffe8,%rsp
    4:   db 6c 24 20             fldt   0x20(%rsp)
    8:   db 6c 24 30             fldt   0x30(%rsp)
    c:   de c9                   fmulp  %st,%st(1)
    e:   dd 5c 24 00             fstpl  0x0(%rsp)
   12:   66 0f 12 44 24 00       movlpd 0x0(%rsp),%xmm0
   18:   48 83 c4 18             add    $0x18,%rsp
   1c:   c3                      retq

you can see that 32b mode provides 12B in the stack frame for a 10B
extended-prec operand, whereas 64b mode aligns mod 16.  if the
source skipped conversion to double, the fstpl/etc goes away and the
full precision is left on the FP stack-top.

I have to assume the AMD doc's rather cryptic comment is simply reflecting
the ABI difference, not anything like encoding or allowed instructions.

does anyone have a concise demo of using higher precision - approximating
sqrt(2) or something?  I have found, on the several linuxes I looked at,
that the x87 control word enabled full 80b precision (it can cause automatic
rounding to double or even single prec.)

>>> This potential itself is not fully utilized as I believe only 40-bits are 
>>> used (the socket
>>> F series may have bumped this up to 48-bits).
>> no, that's physical address bits, which are completely unrelated to virtual 
>> address bits and/or addr register width.  consider that the last 
>> generations of ia32 could address more than 4GB of ram (had more
>> than 32b of physical addressability), but any process still only ever 
>> really had a 32b address space.
> More clarification. Right.  40-bits are used for physical addressing and an 
> additional 8-bits are used to round
> out the virtual space.

bits 0-12 are offset within a page.  then 4x successive 9b chunks index
into the page-translation tree.  the 48th bit is sign-extended up.
so it's not the full 64b, but well, is that a real/realistic problem?

> I believe socket F extends both of these numbers by 8 
> to 48-bit physical and 54-bit
> virtual.  I do not think we are using all 64-bits though ... even in socket F 
> ... but you tend to be right very
> often Mark, so I am hesitating here.  ;-)

no, YOU'RE right that the whole 64b is not reachable (virt or phys).
but then again, it's hard to see why that matters: physical ram is 
basically limited to 8 sockets, 8 dimms each, and ~4GB/dimm (256G, 36b).
and you won't be able to mmap that 256 TB file in one go, VM-wise.
does anyone do distributed systems with pointer-swizzling any more?

>>>      Results are truncated to 64-bits when stored to memory, but a path
>> they can be; they don't have to be.
> Mmm ... I did not know this.  Compiler flags?  What are they?

just use "long double".  the C standard is probably wishy-washy about this
(permitting an implementation to use 64b), but "normal" compilers seem to 
preserve the extra bits.  compiler switches and the runtime do have some 
effect on this, though.  it looks like linux tends to default to enabling
80b (a comment in fpu_control.h claims libm requires it.)

we have users who claim to need "quad precision" floats, and who prefer 
certain cpus/compilers because of quad support.  I'm not sure they've 
ever actually disassembled the results to see whether they're just getting
80b...

regards, mark hahn.