Fortran compilers for Linux/mpich

Robert G. Brown rgb at
Sun Nov 25 08:32:25 PST 2001

On Fri, 23 Nov 2001, Don Holmgren wrote:

> At the very bottom of the page,
> I have a table with cycle counts posted for a number of matrix-matrix
> and matrix-vector routines as measured on a P-III (Coppermine), P4, and
> an Athlon MP.  Times are posted for both a pure-C version of each
> routine, built with gcc, as well as for an SSE version.  The sources
> for each are available at
> The results are a mixed bag, with each flavor processor sometimes first,
> second, or third.  I'm using only a small subset of SSE - mostly shufps,
> addps, mulps, with a few xops, movaps, and movups thrown in.  I haven't
> timed individual instructions on all three processors.
> Don Holmgren
> Fermilab

Awesomely useful, Don, thanks.

Do you have any idea what the overall marginal benefit is of using your
hand-optimized routines when working on large datasets (too big to fit
into cache)?  In particular, does performance devolve to
memory-bandwidth-bound behavior (and hence end up being the same for
MILC and SSE and dominated by the memory bus speed)?


Robert G. Brown	             
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at

More information about the Beowulf mailing list