[Beowulf] trouble running the linpack xhpl benchmark

Fri May 5 14:24:11 PDT 2006

Bruce Allen wrote:
> I've built three other large clusters in the past, but was never 
> motivated to do a Top500 linpack benchmark for them.  This time around, 
> for our new Nemo cluster, I want to have linpack results for the Top500 
> list.  So Kipp Cannon, one of our group's postdocs, has spent a few days 
> setting up and running linpack/xhpl.
> 
> We have 640 dual-core 2.2 GHz opteron 175 nodes with 2GB per node and a 
> good gigE network.
> 
> We're having problems getting xhpl to run on the entire cluster, and are 
> wondering if someone on this list might have insight into what might be 
> going wrong.  At the moment, the software combination is gcc + lam/mpi + 
> atlas + hpl.  Note that in our normal use the cluster runs standalone 
> executables managed via condor (trivially parallel code!) so this is our 
> first use of MPI or any MPI code in at least three years.

Use Goto's blas library.  It is faster than Atlas.

> 
> Testing on up to 338 nodes (676 cores), the benchmark runs fine and we 
> are getting above 60% of peak floating-point performance. But, 
> attempting to use the entire cluster (640 nodes, 1280 cores) seems to 
> trigger the out-of-memory killer on some nodes.  The jobs never really 
> seem to start running, they are killed before calling mpi_init (which is 
> the error message we see from LAM: "job exited before calling mpi_init()").
> 
> The jobs die very quickly, so we have not been able to see how much 
> memory they try to allocate.  We are using a spreadsheet given to us by 
> David Cownie at AMD for calculating the problem size based on the 
> maximum usable RAM per core, and have found that that spreadsheet works 
> correctly: running on 20 cores, 196 cores, and 676 cores with problem 
> sizes chosen by that spreadsheet show the same, predicted, RAM used per 
> core in all cases.
> 
> Could there be some threshold in xhpl, where above some problem size 
> it's RAM usage increases for other reasons?
> 
> What about the "PxQ" parameters?  For 676 cores we are using square P=Q 
> but change this to use 1280 cores.  Does anyone know of problems with 
> running xhpl when P != Q on x86_64?

Have you tried running xhpl on both halves of the system?  This will 
tell you if you have hardware problems on one side of the system.

Also, try setting N to a small number, like 10000, for the entire 
cluster.  You can start isolate what the problem is that way as well.

Just make sure that P<Q.  Keep it as square as possible.  32x40 should
work well for your system.

Craig

> 
> Cheers,
>     Bruce Allen
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf