[Beowulf] Avoiding/mitigating fragmentation of systems by small jobs?
David Mathog
mathog at caltech.edu
Fri Jun 8 09:44:53 PDT 2018
This isn't quite the same issue, but several times I have observed a
large multiCPU machine lock up because the accounting records associates
with a zillion tiny rapidly launched jobs made an enormous
/var/account/pacct file and filled the small root filesystem. Actually
it wasn't usually pacct itself that brought the system to its knees but
the cron scheduled gzip of that file which applied the coup de grace.
That left the original big pacct and a very large partial pacct-$DATE.gz
which used up the last few free bytes.
As far as I know there is no way to selectively disable saving process
accounting records for "all children of process PID". Accounting is
either on or off. Now when I run scripts prone to this accounting is
turned off first.
This was on Centos 6.9, on machines reporting (via /proc/cpuinfo) 48 and
56 cpus.
Regards,
David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
More information about the Beowulf
mailing list