[Beowulf] Setting memory limits on a compute node
Chris Samuel
csamuel at vpac.org
Wed Jun 9 18:23:25 PDT 2004
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On Wed, 9 Jun 2004 01:42 am, Brent M. Clements wrote:
> It appears that the gaussian application is exhausting
> all of the memory in the system essentially stopping the machine from
> working. You can still ping the machine but can't ssh. Anyway's I know the
> fundementals of why this is happening.
[...]
>
> What is the best to approach this kinda of issue? We have come up with a
> few solutions but each one has it's drawbacks.
We've had this problem (not with Gaussian) and the best we could do was dump
the kernels for our distro (RH7.3) and go straight to 2.4.26 and make sure
the OOM killer was disabled.
Basically this seems to be the old OOM killer deadlock problem which is fixed
in more recent kernels.
It's not perfect (it sometimes will kill other processes that try and fail to
malloc() before the real culprit) but it does stop the node completely
grinding into the dirt. We can then use rconsole (from CSM) to drop onto
that node if sshd has been killed off to restart it or reboot it without
having to go and do the hard power-cycle that we used to.
- --
Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin
Victorian Partnership for Advanced Computing http://www.vpac.org/
Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
iD8DBQFAx7gNO2KABBYQAh8RAl7NAJ9WQIz7CWWiFD6IsuViTc9elRn4gACdEmSU
ryNe/mdZ9SUFO4XdjRQGFGk=
=VvWO
-----END PGP SIGNATURE-----
More information about the Beowulf
mailing list