What do you guys thing about the P4
Don Holmgren
djholm at fnal.gov
Thu Apr 5 08:43:24 PDT 2001
I recently built a 1.4 GHz P4 machine, using the Intel D850GB
motherboard. Total cost was about $1100, including 128 MB of 800 MHz
RDRAM (included with the boxed processor from Intel).
Benchmarks on our code (lattice gauge QCD) have been very impressive. I
have a summary graph at
http://qcdhome.fnal.gov/cluster_design/benchmarks.html
which shows relative performance of the P4, various Alphas (ev6 and
ev67), and Katmai and Coppermine Pentium III's. Running out of main
memory on lattices of at least 10 MB, the P4 outperforms all of the
other chips. The code (from the MILC collaboration) is floating point
intensive.
I've also recently started hand coding the low level math kernels
(su3 matrix and vector operations - basically 3x3 complex matrices and
3x1 complex vectors) using SSE. On both P-III and P4, this gives a
~30% boost to performance on this particular MILC code.
Streams numbers on P4's are impressive. Using a compiler that
understands SSE instructions gives even better results. Here's the
output with a gcc build:
Function Rate (MB/s) RMS time Min time Max time
Copy: 1324.0370 0.0492 0.0483 0.0556
Scale: 1336.4782 0.0487 0.0479 0.0552
Add: 1556.6983 0.0621 0.0617 0.0623
Triad: 1541.3021 0.0627 0.0623 0.0628
With a Portland Compiler Group build (-Mvect=sse), I get:
Function Rate (MB/s) RMS time Min time Max time
Copy: 2072.0057 0.0309 0.0309 0.0311
Scale: 1395.3079 0.0463 0.0459 0.0464
Add: 1907.2235 0.0505 0.0503 0.0509
Triad: 1889.2441 0.0509 0.0508 0.0513
Playing a bit, most of this boost comes from pgcc's use of the SSE
prefetch instructions; some benefit comes from moving data via the
128-bit wide SSE registers.
Some problems with P4's:
- no motherboard (I think) yet supports 64/66 PCI. It's a shame to
finally have a memory bus capable of supporting very high rate I/O,
only to be squeezed down by a 32/33 PCI bottleneck.
- SSE support requires a kernel patch for the 2.2.x kernels. gcc has no
specific support (I believe) for SSE, though adding macros to
implement the prefetch instructions is easy, and I'd much rather do
the hand assembly coding in NASM. pgcc will use SSE instructions and
attempt to vectorize.
- No dual P4 motherboards until P4 Xeons are released.
Still, for our code, P4's are excellent. I'm hoping to expand
our80-node dual PIII cluster with dual P4 Xeons (if the price is right -
let the best hardware win!) this summer.
Don Holmgren
Fermilab
djholm at fnal.gov
More information about the Beowulf
mailing list