[Beowulf] hpl - large problems fail

Guy Coates gmpc at sanger.ac.uk
Thu Mar 10 14:11:37 PST 2005


On Thu, 10 Mar 2005, Paul Johnson wrote:

> All:
>
> I have a 4 node cluster(dont snicker :) )

Everyone starts off small.

and Im trying to do some
> benchmarking with HPL.  I want to test 2 of the nodes with 1Gb of
> ram each.  I calculated the maximum problem size that can fit in 2Gb
> and still allow for memory for the operating system.  That came out to
> be around 14500x14500.  When I run that size of a test it always fails.
> The largest problem that I can test and not have it fail on me is
> 12500x12500.
> What is the reason behind this?  Im confused on what is going on here.
> Thanks for any help.


Do you know what actually caused the failure?

If your problem size was too big, and you are really out of memory, you
should see some messages in the system log saying the out-of-memory-killer
was activated and HPL was zapped.

If you know your machines was not actually out of memory, then you have
broken hardware on one of your nodes. Run memtest+ or memtest on your
nodes (Possibly the world's most useful pieces of diagnostic software).

http://www.memtest86.com
http://www.memtest.org


If you haven't seen it, IBM have a redpaper on tuning HPL, which gives
some good starting parameters, problem-sizing tips and an overview of
different BLAS libraries you can compile against to get that extra few
Gflops of performance.

Cheers,

Guy

-- 
Dr. Guy Coates,  Informatics System Group
The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA, UK
Tel: +44 (0)1223 834244 ex 7199




More information about the Beowulf mailing list