[Beowulf] Question about amd64 architecture and floating pointoperations

Fri Nov 24 10:05:59 PST 2006

hi Mark,

Hopefully your 80 bits logics code is not critical to anything.
I wouldn't count at keeping the entire 62 bits (?) mantissa.

Context switch and dang it's gone.

I guess question is how important it is to get a lot of digits.
Consider PFRSQRT which is 3 cycles.

Whereas a floating point square root is 35 cycles.

I'd go for that SIMD; you can binary toy then and add results and get quite 
a lot
more bits significance. Perhaps even faster than in 35 cycles.

Good luck,
Vincent
----- Original Message ----- 
From: "Mark Hahn" <hahn at physics.mcmaster.ca>
To: "Richard Walsh" <rbw at ahpcrc.org>
Cc: "Beowulf Mailing List" <beowulf at beowulf.org>
Sent: Friday, November 24, 2006 3:35 PM
Subject: Re: [Beowulf] Question about amd64 architecture and floating 
pointoperations

>>>> A common confusion ... x86_64 changes nothing about the precision of 
>>>> floats or doubles in
>>>> C or Fortran.
>>>
>>> well, sort of.  it was pretty common to find at least some computations
>>> in ia32 using 80b FP, intentionally or not.  but iirc in long mode
>>> (colloquially x86_64), you no longer get x87 access.
>> An important internal detail.  My "nothing" above was assigned to the 
>> program level
>> and the computable epsilons.  Your point is that in long mode because you 
>> cannot use
>> the x87 FPU there is a potential difference internally--no 80-bit versus 
>> possibly some--
>> Oui?
>
> I had the impression that in (pure) 64b mode, one couldn't use the legacy 
> x87
> instructions.  this doesn't seem to be the case, though - but the amd doc
> (6.1.2 of AMD64 prog man v1) says that x87 codes have to be recompiled.
> for kicks, I compiled the following function using pathscale under x86_64
> with and without -m32:
>
> double foo(long double a, long double b) {
>     long double c = a * b;
>     return c;
> }
>
> m32:
>    0:   83 c4 ec                add    $0xffffffec,%esp
>    3:   db 6c 24 24             fldt   0x24(%esp)
>    7:   db 6c 24 18             fldt   0x18(%esp)
>    b:   de c9                   fmulp  %st,%st(1)
>    d:   dd 5c 24 00             fstpl  0x0(%esp)
>   11:   66 0f 12 44 24 00       movlpd 0x0(%esp),%xmm0
>   17:   f2 0f 11 44 24 08       movsd  %xmm0,0x8(%esp)
>   1d:   dd 44 24 08             fldl   0x8(%esp)
>   21:   83 c4 14                add    $0x14,%esp
>   24:   c3                      ret
>
> x86_64:
>    0:   48 83 c4 e8             add    $0xffffffffffffffe8,%rsp
>    4:   db 6c 24 20             fldt   0x20(%rsp)
>    8:   db 6c 24 30             fldt   0x30(%rsp)
>    c:   de c9                   fmulp  %st,%st(1)
>    e:   dd 5c 24 00             fstpl  0x0(%rsp)
>   12:   66 0f 12 44 24 00       movlpd 0x0(%rsp),%xmm0
>   18:   48 83 c4 18             add    $0x18,%rsp
>   1c:   c3                      retq
>
> you can see that 32b mode provides 12B in the stack frame for a 10B
> extended-prec operand, whereas 64b mode aligns mod 16.  if the
> source skipped conversion to double, the fstpl/etc goes away and the
> full precision is left on the FP stack-top.
>
> I have to assume the AMD doc's rather cryptic comment is simply reflecting
> the ABI difference, not anything like encoding or allowed instructions.
>
> does anyone have a concise demo of using higher precision - approximating
> sqrt(2) or something?  I have found, on the several linuxes I looked at,
> that the x87 control word enabled full 80b precision (it can cause 
> automatic
> rounding to double or even single prec.)
>
>
>>>> This potential itself is not fully utilized as I believe only 40-bits 
>>>> are used (the socket
>>>> F series may have bumped this up to 48-bits).
>>> no, that's physical address bits, which are completely unrelated to 
>>> virtual address bits and/or addr register width.  consider that the last 
>>> generations of ia32 could address more than 4GB of ram (had more
>>> than 32b of physical addressability), but any process still only ever 
>>> really had a 32b address space.
>> More clarification. Right.  40-bits are used for physical addressing and 
>> an additional 8-bits are used to round
>> out the virtual space.
>
> bits 0-12 are offset within a page.  then 4x successive 9b chunks index
> into the page-translation tree.  the 48th bit is sign-extended up.
> so it's not the full 64b, but well, is that a real/realistic problem?
>
>
>> I believe socket F extends both of these numbers by 8 to 48-bit physical 
>> and 54-bit
>> virtual.  I do not think we are using all 64-bits though ... even in 
>> socket F ... but you tend to be right very
>> often Mark, so I am hesitating here.  ;-)
>
> no, YOU'RE right that the whole 64b is not reachable (virt or phys).
> but then again, it's hard to see why that matters: physical ram is 
> basically limited to 8 sockets, 8 dimms each, and ~4GB/dimm (256G, 36b).
> and you won't be able to mmap that 256 TB file in one go, VM-wise.
> does anyone do distributed systems with pointer-swizzling any more?
>
>
>>>>      Results are truncated to 64-bits when stored to memory, but a path
>>> they can be; they don't have to be.
>> Mmm ... I did not know this.  Compiler flags?  What are they?
>
> just use "long double".  the C standard is probably wishy-washy about this
> (permitting an implementation to use 64b), but "normal" compilers seem to 
> preserve the extra bits.  compiler switches and the runtime do have some 
> effect on this, though.  it looks like linux tends to default to enabling
> 80b (a comment in fpu_control.h claims libm requires it.)
>
> we have users who claim to need "quad precision" floats, and who prefer 
> certain cpus/compilers because of quad support.  I'm not sure they've ever 
> actually disassembled the results to see whether they're just getting
> 80b...
>
> regards, mark hahn.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
>