[Beowulf] Question about amd64 architecture and floating pointoperations
Vincent Diepeveen
diep at xs4all.nl
Fri Nov 24 10:05:59 PST 2006
hi Mark,
Hopefully your 80 bits logics code is not critical to anything.
I wouldn't count at keeping the entire 62 bits (?) mantissa.
Context switch and dang it's gone.
I guess question is how important it is to get a lot of digits.
Consider PFRSQRT which is 3 cycles.
Whereas a floating point square root is 35 cycles.
I'd go for that SIMD; you can binary toy then and add results and get quite
a lot
more bits significance. Perhaps even faster than in 35 cycles.
Good luck,
Vincent
----- Original Message -----
From: "Mark Hahn" <hahn at physics.mcmaster.ca>
To: "Richard Walsh" <rbw at ahpcrc.org>
Cc: "Beowulf Mailing List" <beowulf at beowulf.org>
Sent: Friday, November 24, 2006 3:35 PM
Subject: Re: [Beowulf] Question about amd64 architecture and floating
pointoperations
>>>> A common confusion ... x86_64 changes nothing about the precision of
>>>> floats or doubles in
>>>> C or Fortran.
>>>
>>> well, sort of. it was pretty common to find at least some computations
>>> in ia32 using 80b FP, intentionally or not. but iirc in long mode
>>> (colloquially x86_64), you no longer get x87 access.
>> An important internal detail. My "nothing" above was assigned to the
>> program level
>> and the computable epsilons. Your point is that in long mode because you
>> cannot use
>> the x87 FPU there is a potential difference internally--no 80-bit versus
>> possibly some--
>> Oui?
>
> I had the impression that in (pure) 64b mode, one couldn't use the legacy
> x87
> instructions. this doesn't seem to be the case, though - but the amd doc
> (6.1.2 of AMD64 prog man v1) says that x87 codes have to be recompiled.
> for kicks, I compiled the following function using pathscale under x86_64
> with and without -m32:
>
> double foo(long double a, long double b) {
> long double c = a * b;
> return c;
> }
>
> m32:
> 0: 83 c4 ec add $0xffffffec,%esp
> 3: db 6c 24 24 fldt 0x24(%esp)
> 7: db 6c 24 18 fldt 0x18(%esp)
> b: de c9 fmulp %st,%st(1)
> d: dd 5c 24 00 fstpl 0x0(%esp)
> 11: 66 0f 12 44 24 00 movlpd 0x0(%esp),%xmm0
> 17: f2 0f 11 44 24 08 movsd %xmm0,0x8(%esp)
> 1d: dd 44 24 08 fldl 0x8(%esp)
> 21: 83 c4 14 add $0x14,%esp
> 24: c3 ret
>
> x86_64:
> 0: 48 83 c4 e8 add $0xffffffffffffffe8,%rsp
> 4: db 6c 24 20 fldt 0x20(%rsp)
> 8: db 6c 24 30 fldt 0x30(%rsp)
> c: de c9 fmulp %st,%st(1)
> e: dd 5c 24 00 fstpl 0x0(%rsp)
> 12: 66 0f 12 44 24 00 movlpd 0x0(%rsp),%xmm0
> 18: 48 83 c4 18 add $0x18,%rsp
> 1c: c3 retq
>
> you can see that 32b mode provides 12B in the stack frame for a 10B
> extended-prec operand, whereas 64b mode aligns mod 16. if the
> source skipped conversion to double, the fstpl/etc goes away and the
> full precision is left on the FP stack-top.
>
> I have to assume the AMD doc's rather cryptic comment is simply reflecting
> the ABI difference, not anything like encoding or allowed instructions.
>
> does anyone have a concise demo of using higher precision - approximating
> sqrt(2) or something? I have found, on the several linuxes I looked at,
> that the x87 control word enabled full 80b precision (it can cause
> automatic
> rounding to double or even single prec.)
>
>
>>>> This potential itself is not fully utilized as I believe only 40-bits
>>>> are used (the socket
>>>> F series may have bumped this up to 48-bits).
>>> no, that's physical address bits, which are completely unrelated to
>>> virtual address bits and/or addr register width. consider that the last
>>> generations of ia32 could address more than 4GB of ram (had more
>>> than 32b of physical addressability), but any process still only ever
>>> really had a 32b address space.
>> More clarification. Right. 40-bits are used for physical addressing and
>> an additional 8-bits are used to round
>> out the virtual space.
>
> bits 0-12 are offset within a page. then 4x successive 9b chunks index
> into the page-translation tree. the 48th bit is sign-extended up.
> so it's not the full 64b, but well, is that a real/realistic problem?
>
>
>> I believe socket F extends both of these numbers by 8 to 48-bit physical
>> and 54-bit
>> virtual. I do not think we are using all 64-bits though ... even in
>> socket F ... but you tend to be right very
>> often Mark, so I am hesitating here. ;-)
>
> no, YOU'RE right that the whole 64b is not reachable (virt or phys).
> but then again, it's hard to see why that matters: physical ram is
> basically limited to 8 sockets, 8 dimms each, and ~4GB/dimm (256G, 36b).
> and you won't be able to mmap that 256 TB file in one go, VM-wise.
> does anyone do distributed systems with pointer-swizzling any more?
>
>
>>>> Results are truncated to 64-bits when stored to memory, but a path
>>> they can be; they don't have to be.
>> Mmm ... I did not know this. Compiler flags? What are they?
>
> just use "long double". the C standard is probably wishy-washy about this
> (permitting an implementation to use 64b), but "normal" compilers seem to
> preserve the extra bits. compiler switches and the runtime do have some
> effect on this, though. it looks like linux tends to default to enabling
> 80b (a comment in fpu_control.h claims libm requires it.)
>
> we have users who claim to need "quad precision" floats, and who prefer
> certain cpus/compilers because of quad support. I'm not sure they've ever
> actually disassembled the results to see whether they're just getting
> 80b...
>
> regards, mark hahn.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
More information about the Beowulf
mailing list