[Beowulf] HPL Benchmarking and Optimization
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Joshua mora acosta joshua_mora at usa.netWed Apr 2 21:07:30 PDT 2008
- Previous message: [Beowulf] Re: SMPs + One processor machines = Heterogeneous Cluster
- Next message: [Beowulf] 3rd CFP -- CoopIS08
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Get for AMD based systems ACML and gcc,pgi or pathscale Get for Intel based systems MKL and intel compiler run N problem size around 90% workload.is, 1.8GB per core memory footprint. Run NB 192 on AMD, I don't know the best blocking factor for MKL. I've tried the same 192 and does fairly well. Set affinity for the mpi even with 1 socket runs. Run PxQ 2x2,2x4,4x4,.. depending on the number of cores. With the above you should get on AMD and on Intel at least 77% efficiency. As suggested by Tom, Goto library will give you good performance as well. You can try also the multithreaded version so use PxQ=1x1 and OMP_NUM_THREADS=4 for a single socket quadcore. Reduce misses with huge pages. If you get below 75% efficiency, you are doing something wrong. If you do more than 85% on quadcore, please let me know :) Regards, Joshua ------ Original Message ------ Received: Wed, 02 Apr 2008 12:33:25 PM PDT From: Ellis Wilson <xclski at yahoo.com> To: beowulf at beowulf.org Subject: Re: [Beowulf] HPL Benchmarking and Optimization > Ellis Wilson wrote: > > Currently I get these kind of numbers from tested > > computers using the > > same environment (gentoo, fortran in gcc, hpl, all > > same compilation > > options): > > 1 x Core2Duo (2.1ghz/core, 2gigs ram) - 2.3Gflops > > 1 x Athlon 64 3500+ (2.2ghz, 1gig ram) - 1.0Glops > > 4 x Core2Duo (2.1ghz/core for a total of 8 cores, > > 2gigs ram/node, > > 100mbit Ethernet interconnect) - 6.7Gflops > > Sorry to double post all, however, I realized my issue > involved running > HPL on the reference library of BLAS that is generic > for every > architecture and didn't want to waste anyones time. > Giving Portage the > benefit of the doubt, I had failed to check that it's > dependencies were > best for HPL. Following an install of ATLAS and > relinking to its > libraries, I've gotten the following numbers: > 1 x Athlon64 3500+ (2.2ghz, 1gig ram) - 3.6GFlops > 1 x Phenom9600 Quadcore (2.3ghz/core, 2gigs ram) - > 11.9GFlops > > I'll likely try MKL soon for the Intel processors I'm > interested in. > > The phenom9600 had previously only gotten 4.5 GFlops, > and when I tested > it the second time I simply used the same environment > I had compiled for > the athlon64. Certainly compiling ATLAS native on the > phenom will > increase the result, hopefully about 350% like with > the athlon64 (though > I suspect things will be interesting due to bandwidth, > etc for quadcores). > > Anyway, not to end the thread I still am wondering: > > Do those of you who have professional installations or > even simply large > setups that are unsure of the exact code which will be > run upon your > cluster utilize compilation options such as -O3, > funroll-loops, > -fomit-frame-pointer, etc? > > Thanks, > > Ellis > > > > > ____________________________________________________________________________________ > You rock. That's why Blockbuster's offering you one month of Blockbuster Total Access, No Cost. > http://tc.deals.yahoo.com/tc/blockbuster/text5.com > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf >
- Previous message: [Beowulf] Re: SMPs + One processor machines = Heterogeneous Cluster
- Next message: [Beowulf] 3rd CFP -- CoopIS08
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
