[Beowulf] trouble running the linpack xhpl benchmark
Craig Tierney
ctierney at hypermall.net
Fri May 5 14:24:11 PDT 2006
Bruce Allen wrote:
> I've built three other large clusters in the past, but was never
> motivated to do a Top500 linpack benchmark for them. This time around,
> for our new Nemo cluster, I want to have linpack results for the Top500
> list. So Kipp Cannon, one of our group's postdocs, has spent a few days
> setting up and running linpack/xhpl.
>
> We have 640 dual-core 2.2 GHz opteron 175 nodes with 2GB per node and a
> good gigE network.
>
> We're having problems getting xhpl to run on the entire cluster, and are
> wondering if someone on this list might have insight into what might be
> going wrong. At the moment, the software combination is gcc + lam/mpi +
> atlas + hpl. Note that in our normal use the cluster runs standalone
> executables managed via condor (trivially parallel code!) so this is our
> first use of MPI or any MPI code in at least three years.
Use Goto's blas library. It is faster than Atlas.
>
> Testing on up to 338 nodes (676 cores), the benchmark runs fine and we
> are getting above 60% of peak floating-point performance. But,
> attempting to use the entire cluster (640 nodes, 1280 cores) seems to
> trigger the out-of-memory killer on some nodes. The jobs never really
> seem to start running, they are killed before calling mpi_init (which is
> the error message we see from LAM: "job exited before calling mpi_init()").
>
> The jobs die very quickly, so we have not been able to see how much
> memory they try to allocate. We are using a spreadsheet given to us by
> David Cownie at AMD for calculating the problem size based on the
> maximum usable RAM per core, and have found that that spreadsheet works
> correctly: running on 20 cores, 196 cores, and 676 cores with problem
> sizes chosen by that spreadsheet show the same, predicted, RAM used per
> core in all cases.
>
> Could there be some threshold in xhpl, where above some problem size
> it's RAM usage increases for other reasons?
>
> What about the "PxQ" parameters? For 676 cores we are using square P=Q
> but change this to use 1280 cores. Does anyone know of problems with
> running xhpl when P != Q on x86_64?
Have you tried running xhpl on both halves of the system? This will
tell you if you have hardware problems on one side of the system.
Also, try setting N to a small number, like 10000, for the entire
cluster. You can start isolate what the problem is that way as well.
Just make sure that P<Q. Keep it as square as possible. 32x40 should
work well for your system.
Craig
>
> Cheers,
> Bruce Allen
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list