[Beowulf] Avoiding/mitigating fragmentation of systems by small jobs?

David Mathog mathog at caltech.edu
Fri Jun 8 09:44:53 PDT 2018


This isn't quite the same issue, but several times I have observed a 
large multiCPU machine lock up because the accounting records associates 
with a zillion tiny rapidly launched jobs made an enormous 
/var/account/pacct file and filled the small root filesystem.  Actually 
it wasn't usually pacct itself that brought the system to its knees but 
the cron scheduled gzip of that file which applied the coup de grace.  
That left the original big pacct and a very large partial pacct-$DATE.gz 
which used up the last few free bytes.

As far as I know there is no way to selectively disable saving process 
accounting records for "all children of process PID".  Accounting is 
either on or off.  Now when I run scripts prone to this accounting is 
turned off first.

This was on Centos 6.9, on machines reporting (via /proc/cpuinfo) 48 and 
56 cpus.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


More information about the Beowulf mailing list