Archives

- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

# [Beowulf] AMD64 results...

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Robert G. Brown rgb at phy.duke.edu
Sun Dec 19 07:43:18 PST 2004

```On Thu, 16 Dec 2004, Josip Loncaric wrote:

> Robert G. Brown wrote:
> > [...]   One can see how having 64 bits would really
> > speed up 64 bit division compared to doing it in software across
> > multiple 32 bit registers...
>
> Correct me if I'm wrong, but doesn't the floating point unit normally
> use an internal iterative process to perform the division?  This would
> not involve 32-bit registers...
>
> I'm not so sure about *integer* 64-bit division.  Integer division may
> involve multiple 32-bit integer registers.
>
> Good ole' Cray-1 used an iterative process for floating point division
> which worked like this: given a floating point number x, use the first 8
> bits of the mantissa to index into a lookup table containing initial
> guesses, then do a few steps of Newton-Raphson iteration involving only
> multiply-add operations to get the fully converged reciprocal mantissa,
> fix the exponent, thus obtaining 1/x, then multiply y*(1/x) to get y/x.
>
> As I recall, the famous Pentium FDIV bug involved some corner cases in a
> similar iterative process, all of which is internal to the floating
> point unit.  Moreover, in addition to following the 32/64-bit IEEE 754
> standard for floating point arithmetic, some implementations (e.g.
> Pentium, Opteron) support x87 legacy internal 80-bit representations of
> floating point numbers, which can really help when accumulating long
> sums and computing square roots, etc.  Prof. Kahane has numerous
> arguments in favor of this internal 80-bit representation...

This may well be -- I used to hand code the 8087 back on the IBM PC and
thought that the 80 bit internal representation was peachy keen at the
time.  I haven't tracked precisely how the x87 coprocessor model has
evolved (legacy or not) into P6-class processors, though -- the mixing
of RISC, CISC, CISC-interpreted-to-RISC-onchip left me confused years
ago.

I was really just making an empirical observation, and struggling to
understand it.  As I pointed out yesterday, trancendental evals seem to
be much faster as well, which would certainly be consistent with a
resurrection of an efficient internal x87 architecture.  If so, I'm all
for it -- HPC code (at least MY HPC code:-) tends to have more than just
triad-like operations on vectors -- things like the trig functions,
exponentials and logs, floating point division.  I remember when my Sun
386i could turn in a savage that compared pretty well with the otherwise
much faster Sun 110 and Sparc 1 because it had a real CISC 80387 and Sun
was doing all of its trancendental calls in (RISC) software.

rgb

--
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

```