[Beowulf] Avoiding/mitigating fragmentation of systems by small jobs?
Chris Samuel
chris at csamuel.org
Mon Jun 11 21:28:25 PDT 2018
On Sunday, 10 June 2018 1:48:18 AM AEST Skylar Thompson wrote:
> Unfortunately we don't have a mechanism to limit
> network usage or local scratch usage
Our trick in Slurm is to use the slurmdprolog script to set an XFS project
quota for that job ID on the per-job directory (created by a plugin which
also makes subdirectories there that it maps to /tmp and /var/tmp for the
job) on the XFS partition used for local scratch on the node.
If they don't request an amount via the --tmp= option then they get a default
of 100MB. Snipping the relevant segments out of our prolog...
JOBSCRATCH=/jobfs/local/slurm/${SLURM_JOB_ID}.${SLURM_RESTART_COUNT}
if [ -d ${JOBSCRATCH} ]; then
QUOTA=$(/apps/slurm/latest/bin/scontrol show JobId=${SLURM_JOB_ID} | egrep MinTmpDiskNode=[0-9] | awk -F= '{print $NF}')
if [ "${QUOTA}" == "0" ]; then
QUOTA=100M
fi
/usr/sbin/xfs_quota -x -c "project -s -p ${JOBSCRATCH} ${SLURM_JOB_ID}" /jobfs/local
/usr/sbin/xfs_quota -x -c "limit -p bhard=${QUOTA} ${SLURM_JOB_ID}" /jobfs/local
Hope that is useful!
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
More information about the Beowulf
mailing list