[Beowulf] head node abuse

Fri Mar 26 14:25:32 UTC 2021

Honest advice ... aka my personal $.02 ...

This is a problem that can't entirely be solved via technical means like 
resource constraints or cgroup controls. This is more of a training, 
knowledge transfer and acceptable use policy issue and fixing the 
problem has to include these elements.

What I've learned over many years is that end-users looking to game the 
system will always have more time and more motivation to find evasive 
methods than IT and sysadmins have to catch and close the loopholes.

I tend to recommend making "head node abuse" an employee behavior / 
management issue and I only do the bare minimum resource fencing on the 
head nodes and submission nodes to keep the nodes from being run into 
the ground.

Process works like this:

- If you want to use the Cluster you either take a short training course 
or if you are experienced you read and sign our HPC acceptable use 
policy that clearly explains what you can and cannot do on the head 
nodes, submit nodes and login nodes. We also point you to all our 
documentation and training resources

- The first 1-2 times you are "caught" abusing the head node we treat it 
as a simple training and knowledge transfer opportunity. No real 
repercussions and a good opportunity for IT to reach out and work 1:1 
with an end user to learn her/his requirements and workflow interests. 
99% of the time the head node abuse stops here.

- The third time you are caught abusing the head node your login access 
is terminated until you review the acceptable use policy and return a 
documented acknowledgement. Your manager is CC'd on these emails but no 
other repercussions

- The forth time you are caught we treat this as a non-trivial violation 
of organizational policies. HR is notified along with your management 
chain. Your cluster access is terminated until there is some sort of 
process and plan worked through with HR and the user's manager

> Michael Di Domenico <mailto:mdidomenico4 at gmail.com>
> March 26, 2021 at 9:56 AM
> does anyone have a recipe for limiting the damage people can do on
> login nodes on rhel7. i want to limit the allocatable cpu/mem per
> user to some low value. that way if someone kicks off a program but
> forgets to 'srun' it first, they get bound to a single core and don't
> bump anyone else.
>
> i've been poking around the net, but i can't find a solution, i don't
> understand what's being recommended, and/or i'm implementing the
> suggestions wrong. i haven't been able to get them working. the most
> succinct answer i found is that per user cgroup controls have been
> implemented in systemd v239/240, but since rhel7 is still on v219
> that's not going to help. i also found some wonkiness that runs a
> program after a user logs in and hacks at the cgroup files directly,
> but i couldn't get that to work.
>
> supposedly you can override the user-{UID}.slice unit file and jam in
> the cgroup restrictions, but I have hundreds of users clearly that's
> not maintainable
>
> i'm sure others have already been down this road. any suggestions?
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit 
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210326/e3c79d2a/attachment.htm>