[Beowulf] Setting memory limits on a compute node

Chris Samuel csamuel at vpac.org
Wed Jun 9 18:23:25 PDT 2004


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, 9 Jun 2004 01:42 am, Brent M. Clements wrote:

> It appears that the gaussian application is exhausting
> all of the memory in the system essentially stopping the machine from
> working. You can still ping the machine but can't ssh. Anyway's I know the
> fundementals of why this is happening.
[...]
>
> What is the best to approach this kinda of issue? We have come up with a
> few solutions but each one has it's drawbacks.

We've had this problem (not with Gaussian) and the best we could do was dump 
the kernels for our distro (RH7.3) and go straight to 2.4.26 and make sure 
the OOM killer was disabled.

Basically this seems to be the old OOM killer deadlock problem which is fixed 
in more recent kernels.

It's not perfect (it sometimes will kill other processes that try and fail to 
malloc() before the real culprit) but it does stop the node completely 
grinding into the dirt.   We can then use rconsole (from CSM) to drop onto 
that node if sshd has been killed off to restart it or reboot it without 
having to go and do the hard power-cycle that we used to.

- -- 
 Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFAx7gNO2KABBYQAh8RAl7NAJ9WQIz7CWWiFD6IsuViTc9elRn4gACdEmSU
ryNe/mdZ9SUFO4XdjRQGFGk=
=VvWO
-----END PGP SIGNATURE-----




More information about the Beowulf mailing list