Clarification: [Beowulf] hpl - large problems fail

Paul Johnson redboots at ufl.edu
Thu Mar 10 14:56:16 PST 2005


Guy Coates wrote:

>On Thu, 10 Mar 2005, Paul Johnson wrote:
>
>  
>
>>All:
>>
>>I have a 4 node cluster(dont snicker :) )
>>    
>>
>
>Everyone starts off small.
>
>and Im trying to do some
>  
>
>>benchmarking with HPL.  I want to test 2 of the nodes with 1Gb of
>>ram each.  I calculated the maximum problem size that can fit in 2Gb
>>and still allow for memory for the operating system.  That came out to
>>be around 14500x14500.  When I run that size of a test it always fails.
>>The largest problem that I can test and not have it fail on me is
>>12500x12500.
>>What is the reason behind this?  Im confused on what is going on here.
>>Thanks for any help.
>>    
>>
>
>
>Do you know what actually caused the failure?
>
>If your problem size was too big, and you are really out of memory, you
>should see some messages in the system log saying the out-of-memory-killer
>was activated and HPL was zapped.
>
>If you know your machines was not actually out of memory, then you have
>broken hardware on one of your nodes. Run memtest+ or memtest on your
>nodes (Possibly the world's most useful pieces of diagnostic software).
>
>http://www.memtest86.com
>http://www.memtest.org
>
>
>If you haven't seen it, IBM have a redpaper on tuning HPL, which gives
>some good starting parameters, problem-sizing tips and an overview of
>different BLAS libraries you can compile against to get that extra few
>Gflops of performance.
>
>Cheers,
>
>Guy
>
>  
>
I should have been more clearer in my description.  It doesn't fail at 
the command prompt when I run it.  It fails when it checks the solution 
to linear equations.  The residual is too high and fails.  This is part 
of the data from my HPL.out file:

============================================================================
T/V                N    NB     P     Q               Time             Gflops
----------------------------------------------------------------------------
WC12R2L4       14500    64     1     2             388.43          5.233e+00
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =   284363.4669186 ...... FAILED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =   210262.3627204 ...... FAILED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =    41377.6398965 ...... FAILED
||Ax-b||_oo  . . . . . . . . . . . . . . . . . =           0.001692
||A||_oo . . . . . . . . . . . . . . . . . . . =        3708.772315
||A||_1  . . . . . . . . . . . . . . . . . . . =        3695.221759
||x||_oo . . . . . . . . . . . . . . . . . . . =           6.847285
||x||_1  . . . . . . . . . . . . . . . . . . . =       19610.120504
============================================================================

Sorry for the confusion,
Paul

-- 
Paul Johnson
Graduate Student - Mechanical Engineering
University of Florida - Gainesville, Fl
http://plaza.ufl.edu/redboots

Reclaim Your Inbox!
http://www.mozilla.org/products/thunderbird

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050310/9349bbcf/attachment.html>


More information about the Beowulf mailing list