What do you guys thing about the P4

Don Holmgren djholm at fnal.gov
Thu Apr 5 08:43:24 PDT 2001

I recently built a 1.4 GHz P4 machine, using the Intel D850GB
motherboard.  Total cost was about $1100, including 128 MB of 800 MHz
RDRAM (included with the boxed processor from Intel). 

Benchmarks on our code (lattice gauge QCD) have been very impressive.  I
have a summary graph at
which shows relative performance of the P4, various Alphas (ev6 and
ev67), and Katmai and Coppermine Pentium III's.  Running out of main
memory on lattices of at least 10 MB, the P4 outperforms all of the
other chips.  The code (from the MILC collaboration) is floating point

I've also recently started hand coding the low level math kernels
(su3 matrix and vector operations - basically 3x3 complex matrices and
3x1 complex vectors) using SSE.  On both P-III and P4, this gives a
~30% boost to performance on this particular MILC code.

Streams numbers on P4's are impressive.  Using a compiler that
understands SSE instructions gives even better results.  Here's the
output with a gcc build:

  Function      Rate (MB/s)   RMS time     Min time     Max time
  Copy:        1324.0370       0.0492       0.0483       0.0556
  Scale:       1336.4782       0.0487       0.0479       0.0552
  Add:         1556.6983       0.0621       0.0617       0.0623
  Triad:       1541.3021       0.0627       0.0623       0.0628

With a Portland Compiler Group build (-Mvect=sse), I get:

  Function      Rate (MB/s)   RMS time     Min time     Max time
  Copy:        2072.0057       0.0309       0.0309       0.0311
  Scale:       1395.3079       0.0463       0.0459       0.0464
  Add:         1907.2235       0.0505       0.0503       0.0509
  Triad:       1889.2441       0.0509       0.0508       0.0513

Playing a bit, most of this boost comes from pgcc's use of the SSE
prefetch instructions; some benefit comes from moving data via the
128-bit wide SSE registers. 

Some problems with P4's:
- no motherboard (I think) yet supports 64/66 PCI.  It's a shame to
  finally have a memory bus capable of supporting very high rate I/O,
  only to be squeezed down by a 32/33 PCI bottleneck.
- SSE support requires a kernel patch for the 2.2.x kernels.  gcc has no
  specific support (I believe) for SSE, though adding macros to
  implement the prefetch instructions is easy, and I'd much rather do
  the hand assembly coding in NASM.  pgcc will use SSE instructions and
  attempt to vectorize.
- No dual P4 motherboards until P4 Xeons are released.

Still, for our code, P4's are excellent.  I'm hoping to expand
our80-node dual PIII cluster with dual P4 Xeons (if the price is right -
let the best hardware win!) this summer.

Don Holmgren
djholm at fnal.gov

More information about the Beowulf mailing list