[Beowulf] hang-up of HPC Challenge

Mikhail Kuzminsky kus at free.net
Tue Aug 19 16:45:43 PDT 2008


For some localization of possible problem reason, I ran pure HPL test 
instead of HPCC. HPL performs direct output to screen instead of 
writing to the file.

Using MPICH w/np=8 I obtained normal HPL result for N=35000 - 
including 
3 "PASSED" strings for ||Ax-b|| calculations. BUT ! Linux hang-ups 
immediately after output of this strings.

Mikhail 
  

In message from "Mikhail Kuzminsky" <kus at free.net> (Mon, 18 Aug 2008 
22:20:16 +0400):
>I ran a set of HPC Challenge benchmarks on ONE dual socket quad-core 
>Opteron2350 (Rev. B3) based server (8 logical CPUs).
>RAM size is 16 Gbytes. The tests performed were under SuSE 
>10.3/x86-64, for LAM MPI 7.1.4 and MPICH 1.2.7 from SuSE 
>distribution, using Atlas 3.9. Unfortunately there is only one such 
>cluster node, and I can't reproduce the run on another node :-(
>
>For N (matrix size) up to 10000 all looks OK. But for more large N 
>(15000/20000/...) hpcc execution (mpirun -np 8 hpcc) leads to Linux 
>hang-up.
>
>In the "top" output I see 8 hpcc examplars each eating about 100% of 
>CPU, and reasonable amounts of virtual and RSS memory per hpcc 
>process, and the absense of swap using. Usually there is no PTRANS 
>results in hpccoutf.txt results file, but in a few cases (when I 
>"activelly looked" to hpcc execution by means of ps/top issuing) I 
>see reasonable PTRANS results but absense of HPLinpack results. One 
>time I obtained PTRANS, HPL and DGEMM results for N=20000, but hangup 
>later - on STREAM tests. May be it's simple because of absense (at 
>hangup) of final writing of output buffer to output file on HDD.
>
>One of possible reasons of hang-ups is memory hardware problem, but 
>what is about possible software reasons of hangups ? 
>The hpcc executable is 64-bit dynamically linked. 
>/etc/security/limits.conf is empty. stacksize limit (for user issuing 
>mpirun) is "unlimited", main memory limit - about 14 GB, virtual 
>memory limit - about 30 GB. Atlas was compiled for 32-bit integers, 
>but it's enough for such N values. Even /proc/sys/kernel/shmmax is 
>2^63-1.
>
>What else may be the reason of hangup ?
>
>Mikhail Kuzminskiy
>Computer Assistance to Chemical Research Center
>Zelinsky Institute of Organic Chemistry
>Moscow
>  
>
>  
>
>  
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit 
>http://www.beowulf.org/mailman/listinfo/beowulf




More information about the Beowulf mailing list