What do you guys thing about the P4
Josh Fryman
fryman at cc.gatech.edu
Wed Apr 4 21:10:16 PDT 2001
i'm not familiar with your specific tests, but i can address the questions
you asked. loosely paraphrased: the p4 architecture (insert "CPU XYZ" for
P4 here as well) sometimes simply sucks running non-p4 optimized code.
why?
(deep) pipelines require compiler tricks to reschedule code to hide stalls
and latencies. the p4 has a pipeline 2x as deep as p3. every time a stall
would occur and was hidden in your p3 code, you now kill 10 cycles waiting
for the 20-cycle pipeline to be ready. think how often you do something
that induces stalls - like memory dependent instructions
a = b + c
b = a * 2
and things like loops/branches which induce other problems with similar
consequences. also, realize that the time spent doing arithmetic ops
is dependent on the type of op you ask for. sometimes (well, it used to
be the case in the bad old days) a bunch of ADD's would run much faster
than one MUL. so if the compiler doesn't know how to use the fastest-
time ops for a given program segment, it may choose poorly through
ignorance. now think about those fancy bulk-move-memory instructions that
are all-new, and some of those vector-multiply instructions, etc that
the P4 introduced. how many chunks of 10's of instructions can be
replaced by something that's 1 instruction that executes in some fraction
of the time?
every time a new CPU comes out with new architectural features, it takes
a while for everything to get recompiled. once it's recompiled, and people
find it's still slow, they back up and look at why. a little profiling
shows that *that* loop the compiler isn't smart enough to unroll 10 times,
so you have to do it by hand in the code. then it gives good results.
this is the way everything gets done... optimizing compilers with specific
architecture support are good, but can't beat humans.
a simple example. when the p4 first sampled to hardware groups for eval,
everyone benchmarked things like MPEG2 and MPEG3 encode/decode and such.
they found that the P4 at 1.4G was SLOWER than a P3 at 800M. why? b/c of
the pipeline stalls and assembly optimizations that were meant for a
different CPU's internals.
a couple of intel engineers, without doing a full optimization, made a
couple of tweaks to the ASM and recompiled with a P4-aware compiler and
it kicked the pants off everything around. they made their changes
public. the hardware sites then invited AMD to optimize their code for
their chips and do a new comparison, but AMD has thus far declined to
do so. why? probably because they know it won't help much if at all.
don't get me wrong - AMD makes good stuff. but just because whiz-bang
processor comes out doesn't mean you can do whiz-bang stuff right away.
and if you're running an App that is *performance* bound, then it's
*your* job to figure out how to make it run as fast as possible on the
hardware you've got, and optimize it accordingly. never trust a
compiler to do it right. you can easily get 50-500% improvements in
your programs if you know how to do this. the same thing applies to
partitioning the communication latencies in parallel apps. but as
always, this is application-specific and YMMV.
all of this is thoroughly discussed in Hennessy & Patterson's books.
specifically, Computer Architecture: A Quantitative Approach. (well, by
thoroughly i mean that it's explained in detail as a general problem;
the specifics of the P4 you'd have to read up on from Intel and various
places like "tom's hardware" and such.) this is *the* architecture
reference book, a must-have for you if you need to do/know this stuff.
also see "parallel computer architecture" ... don't recall the authors
at the moment. ask if you want to know.
More information about the Beowulf
mailing list