Clarification: [Beowulf] hpl - large problems fail
Craig Tierney
ctierney at HPTI.com
Fri Mar 11 07:06:21 PST 2005
On Fri, 2005-03-11 at 07:46, Guy Coates wrote:
> > the command prompt when I run it. It fails when it checks the solution
> > to linear equations. The residual is too high and fails. This is part
> > of the data from my HPL.out file:
> >
>
> This could still be dodgy memory; if bits get flipped then you can expect
> those sorts of numerical instabilities.
>
> Try running a single HPL job on each machine. If you get the correct
> answer on 3 machines and the wrong answer on one, then you've narrowed it
> down to hardware.
>
> If you get the wrong answer on all your machines then you probably have a
> software problem. Try recompiling HPL with no compiler optimisations, a
> different compiler and/or blas library.
>
>
> If that doesn't work, then it might just be possible that you are into
> wierd hardware/kernel bug territory. I ran into similar HPL problems
> whilst benchmarking a rather large hardware purchase we made several years
> ago. The HPL residuals were coming out as NaN. Recompiling with a
> different compiler gave the same result. Rather worryingly, the same
> binaries ran correctly when run on different hardware. After alot of head
> scratching and phonecalls to an extremely worried vendor ("Hey, this kit
> you sold us can't do maths properly!") the problem was tracked down to a
> dodgy kernel module. It turned out that the module provided by the vendor
> to do console-over-lan stomped over the floating point registers under
> certain circumstances.
>
It could also be the interconnect. If you are using ethernet,
I would think it is unlikely but I have seen issues with high-speed
interconnects where they had a problem with the PCI slot, and
we would get wrong answers when running HPL on more than 2 systems.
Craig
More information about the Beowulf
mailing list