Clarification: [Beowulf] hpl - large problems fail

Guy Coates gmpc at sanger.ac.uk
Fri Mar 11 06:46:24 PST 2005


> the command prompt when I run it.  It fails when it checks the solution
> to linear equations.  The residual is too high and fails.  This is part
> of the data from my HPL.out file:
>

This could still be dodgy memory; if bits get flipped then you can expect
those sorts of numerical instabilities.

Try running a single HPL job on each machine. If you get the correct
answer on 3 machines and the wrong answer on one, then you've narrowed it
down to hardware.

If you get the wrong answer on all your machines then you probably have a
software problem. Try recompiling HPL with no compiler optimisations, a
different compiler and/or blas library.


If that doesn't work, then it might just be possible that you are into
wierd hardware/kernel bug territory.  I ran into similar HPL problems
whilst benchmarking a rather large hardware purchase we made several years
ago. The HPL residuals were coming out as NaN.  Recompiling with a
different compiler gave the same result. Rather worryingly, the same
binaries ran correctly when run on different hardware. After alot of head
scratching and phonecalls to an extremely worried vendor ("Hey, this kit
you sold us can't do maths properly!") the problem was tracked down to a
dodgy kernel module. It turned out that the module provided by the vendor
to do console-over-lan stomped over the floating point registers under
certain circumstances.

Guy

-- 
Dr. Guy Coates,  Informatics System Group
The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA, UK
Tel: +44 (0)1223 834244 ex 7199





More information about the Beowulf mailing list