[Beowulf] AMD64 results...

Thu Dec 16 08:17:06 PST 2004

Robert G. Brown wrote:
> [...]   One can see how having 64 bits would really
> speed up 64 bit division compared to doing it in software across
> multiple 32 bit registers...

Correct me if I'm wrong, but doesn't the floating point unit normally 
use an internal iterative process to perform the division?  This would 
not involve 32-bit registers...

I'm not so sure about *integer* 64-bit division.  Integer division may 
involve multiple 32-bit integer registers.

Good ole' Cray-1 used an iterative process for floating point division 
which worked like this: given a floating point number x, use the first 8 
bits of the mantissa to index into a lookup table containing initial 
guesses, then do a few steps of Newton-Raphson iteration involving only 
multiply-add operations to get the fully converged reciprocal mantissa, 
fix the exponent, thus obtaining 1/x, then multiply y*(1/x) to get y/x.

As I recall, the famous Pentium FDIV bug involved some corner cases in a 
similar iterative process, all of which is internal to the floating 
point unit.  Moreover, in addition to following the 32/64-bit IEEE 754 
standard for floating point arithmetic, some implementations (e.g. 
Pentium, Opteron) support x87 legacy internal 80-bit representations of 
floating point numbers, which can really help when accumulating long 
sums and computing square roots, etc.  Prof. Kahane has numerous 
arguments in favor of this internal 80-bit representation...

Sincerely,
Josip