[Beowulf] OOM errors when running HPL
prentice at ias.edu
Mon Dec 22 14:12:39 PST 2008
Skylar Thompson wrote:
> Prentice Bisbal wrote:
>> I've got a new problem with my cluster. Some of this problem may be with
>> my queuing system (SGE), but I figured I'd post here first.
>> I've been using hpl to test my new cluster. I generally run a small
>> problem size (Ns=60000)so the job only runs 15-20 minutes. Last night, I
>> upped the problem size by a factor of 10 to Ns=600000). Shortly after
>> submitting the job, have the nodes were shown as down in Ganglia.
>> I killed the job with qdel, and the majority of the nodes came back, but
>> about 1/3 did not. When I came in this morning, there were kernel
>> panic/OOM type messages on the consoles of the systems that never came
>> I used to run hpl jobs much bigger than this on my cluster w/o a
>> problem. There's nothing I actively changes, but there might have been
>> some updates to the OS (kernel, libs, etc) since the last time I ran a
>> job this big. Any ideas where I should begin looking?
> I've run into similar problems, and traced it to the way Linux
> overcommits RAM. What are your vm.overcommit_memory and
> vm.overcommit_ratio sysctls set to, and how much swap and RAM do the
> nodes have?
I found the problem - it was me. I never ran HPL problems with Ns=600k.
The largest job I ran was ~320k. I figured this out after checking my
notes. Sorry for the trouble.
However, I did want to configure my systems so that they handle requests
for more memory more gracefully, so I added this to my sysctl.conf file
(Thanks for the reminder, Skylar!)
I am actually using this on many of my other computational servers to
prevent OOM crashes, but forgot to add this to my cluster nodes.
Thanks to everyone for the replies.
More information about the Beowulf