[Beowulf] Peformance penalty when using 128-bit reals on AMD64

Fri Jun 25 08:13:20 PDT 2010

I can certainly imagine 2-8x slowdown; 4x for say multiplication (I believe AMD64 doesn't support quad-precision in hardware, so everything has to be emulated) and 2x for the extra memory bandwidth.   32x seems harsh, but isn't obviously crazy.

This sounds a lot like blindly using a sledgehammer, though.   If the user absolutely requires quad-precision everywhere because they need precision everywhere in their calculation better than one part in 1e16, then they're basically just doomed; but there are very few applications in that regime.  Likely there's some part of their problem which is particularly sensitive to the numerics (or they're just using crappy numerics everywhere).

One nice thing about the flurry of GPGPU activity is that it's inspired a resurgence of interest in `mixed precision algorithms', where parts of the numerics are implemented (or emulated) at very high precision, and others are implemented at lower precision.    It might be worth googling around a bit  for their particular problem to see if people have implemented that sort of approach for their particular problem.

Of course if they really really need quad precision they should find an architecture (the Power series) that supports quad precision in hardware; but they'll always end up having to pay the 2x memory bandwidth penalty, no way around that.

The Gnu GMP, which is very cute and well implemented, is definitely not a way to make things go *faster*.   It may well be faster than the other arbitrary-precision libraries out there, but I would expect it to be slower than (fixed) quad precision.   On the other hand, if there's only a small portion of the code that needs that approach and the rest can be done in double, there may not be a huge speed penalty.

     Jonathan

-- 
Jonathan Dursi <ljdursi at scinet.utoronto.ca>