[Beowulf] OOM errors when running HPL

Mon Dec 22 05:52:44 PST 2008

Alan Louis Scheinine wrote:
> A year ago large memory jobs would cause AMD nodes to crash
> on the cluster for which I was system administrator.
> /var/log/messages showed out of memory errors before the crash.
> I can't say that the problem has been solved, I refer to last
> year because I changed jobs.
> 
> In order to understand if the problem is a known bug (as in the
> case cited above) please specify the main board, the amount of
> memory, the number of cores and the version of the kernel.
> 
> You wrote:
>> I used to run hpl jobs much bigger than this on my cluster w/o a
>> problem.
> 
> How does the amount of memory on the new cluster compare to the cluster
> in which you did not have a problem.  In particular, the amount of
> memory per core, assuming all cores were used in your testing.

Alan, thanks for the reply. It's the same cluster - jobs that ran on it
a few weeks ago, are no longer running. There has been no hardware
changes, so I don't think it's a hardware problem. The only difference I
can think if is that I'm now using SGE to launch these jobs, which I may
not have been doing the last time I ran a job this big.

The only other possible software changes are kernel package updates that
may have occurred since the last successful run of a job this big.

-- 
Prentice