[Beowulf] [External] head node abuse

Adam DeConinck ajdecon at ajdecon.org
Fri Mar 26 14:41:13 UTC 2021


I agree with Chris D that this is more of a human problem than a technical
problem. I have actually had a lot of success with user education -- people
don't often think about the implications of having lots of people logged
into the same head node, but get the idea when you explain it. Especially
when you explain it along the lines of, "if we let all these other people
test their MPI jobs on the head node, it would slow down YOUR work!"

Granted, people don't tend to read that explanation in the onboarding doc,
and I often have to re-explain it when it comes up in practice. ;-) But in
general I rarely see "repeat offenders", and when it happens removing
access is the right policy.

We do ALSO enforce some per-user limits with cgroups (auto-generating the
user-{UID}.slice as part of the user onboarding process). But in practice
this mostly protects against accidental abuse ("whoops, I launched mpirun
in the wrong terminal!"). The rare people who intentionally misuse the head
node will find work-arounds.

Arbiter looks really interesting but I haven't had a chance to play with it
yet. Need to bump that further up the priority list...

On Fri, Mar 26, 2021 at 8:27 AM Prentice Bisbal via Beowulf <
beowulf at beowulf.org> wrote:

> Yes, there's a tool developed specifically for this called Arbiter that
> uses Linux cgroups to dynamically limit resources on a login node based
> on it's current load. It was developed at the University of Utah:
>
> https://dylngg.github.io/resources/arbiterTechPaper.pdf
>
> https://gitlab.chpc.utah.edu/arbiter2/arbiter2
>
> Prentice
>
> On 3/26/21 9:56 AM, Michael Di Domenico wrote:
> > does anyone have a recipe for limiting the damage people can do on
> > login nodes on rhel7.  i want to limit the allocatable cpu/mem per
> > user to some low value.  that way if someone kicks off a program but
> > forgets to 'srun' it first, they get bound to a single core and don't
> > bump anyone else.
> >
> > i've been poking around the net, but i can't find a solution, i don't
> > understand what's being recommended, and/or i'm implementing the
> > suggestions wrong.  i haven't been able to get them working.  the most
> > succinct answer i found is that per user cgroup controls have been
> > implemented in systemd v239/240, but since rhel7 is still on v219
> > that's not going to help.  i also found some wonkiness that runs a
> > program after a user logs in and hacks at the cgroup files directly,
> > but i couldn't get that to work.
> >
> > supposedly you can override the user-{UID}.slice unit file and jam in
> > the cgroup restrictions, but I have hundreds of users clearly that's
> > not maintainable
> >
> > i'm sure others have already been down this road.  any suggestions?
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> > To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210326/fb3d8ed7/attachment.htm>


More information about the Beowulf mailing list