[Beowulf] [External] head node abuse [EXT]

Peter Clapham pc7 at sanger.ac.uk
Tue May 25 11:06:39 UTC 2021


+1 to this.


Relying on tech to solve issues like this tend to fall foul of new required software stacks and their requirements on head nodes, e.g. Nextflow, Cromwell etc. So a combined people process and education/assistance approach combined with some tech safty nets seems appropriate as long as there is agreement on what is proportionate.


Another $0.02 in the pot.


Pete

________________________________
From: Beowulf <beowulf-bounces at beowulf.org> on behalf of Adam DeConinck <ajdecon at ajdecon.org>
Sent: 26 March 2021 14:41:13
To: beowulf
Subject: Re: [Beowulf] [External] head node abuse [EXT]

I agree with Chris D that this is more of a human problem than a technical problem. I have actually had a lot of success with user education -- people don't often think about the implications of having lots of people logged into the same head node, but get the idea when you explain it. Especially when you explain it along the lines of, "if we let all these other people test their MPI jobs on the head node, it would slow down YOUR work!"

Granted, people don't tend to read that explanation in the onboarding doc, and I often have to re-explain it when it comes up in practice. ;-) But in general I rarely see "repeat offenders", and when it happens removing access is the right policy.

We do ALSO enforce some per-user limits with cgroups (auto-generating the user-{UID}.slice as part of the user onboarding process). But in practice this mostly protects against accidental abuse ("whoops, I launched mpirun in the wrong terminal!"). The rare people who intentionally misuse the head node will find work-arounds.

Arbiter looks really interesting but I haven't had a chance to play with it yet. Need to bump that further up the priority list...

On Fri, Mar 26, 2021 at 8:27 AM Prentice Bisbal via Beowulf <beowulf at beowulf.org<mailto:beowulf at beowulf.org>> wrote:
Yes, there's a tool developed specifically for this called Arbiter that
uses Linux cgroups to dynamically limit resources on a login node based
on it's current load. It was developed at the University of Utah:

https://dylngg.github.io/resources/arbiterTechPaper.pdf [dylngg.github.io]<https://urldefense.proofpoint.com/v2/url?u=https-3A__dylngg.github.io_resources_arbiterTechPaper.pdf&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=ciV1Zl7RPeuv0_H4fzZoLA&m=jUxZTm3l_TT87vFfPrlhvWT2POGMswS6vTyIkEjVjyU&s=gwpPkYNTzMTYPvXyraDkqTxS4JC5zdi2N185rCVmJUQ&e=>

https://gitlab.chpc.utah.edu/arbiter2/arbiter2 [gitlab.chpc.utah.edu]<https://urldefense.proofpoint.com/v2/url?u=https-3A__gitlab.chpc.utah.edu_arbiter2_arbiter2&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=ciV1Zl7RPeuv0_H4fzZoLA&m=jUxZTm3l_TT87vFfPrlhvWT2POGMswS6vTyIkEjVjyU&s=EXAR0DuaiI-smk-EKfliP2cN0D3s1Iq0PLck4KandHg&e=>

Prentice

On 3/26/21 9:56 AM, Michael Di Domenico wrote:
> does anyone have a recipe for limiting the damage people can do on
> login nodes on rhel7.  i want to limit the allocatable cpu/mem per
> user to some low value.  that way if someone kicks off a program but
> forgets to 'srun' it first, they get bound to a single core and don't
> bump anyone else.
>
> i've been poking around the net, but i can't find a solution, i don't
> understand what's being recommended, and/or i'm implementing the
> suggestions wrong.  i haven't been able to get them working.  the most
> succinct answer i found is that per user cgroup controls have been
> implemented in systemd v239/240, but since rhel7 is still on v219
> that's not going to help.  i also found some wonkiness that runs a
> program after a user logs in and hacks at the cgroup files directly,
> but i couldn't get that to work.
>
> supposedly you can override the user-{UID}.slice unit file and jam in
> the cgroup restrictions, but I have hundreds of users clearly that's
> not maintainable
>
> i'm sure others have already been down this road.  any suggestions?
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org<mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf [beowulf.org]<https://urldefense.proofpoint.com/v2/url?u=https-3A__beowulf.org_cgi-2Dbin_mailman_listinfo_beowulf&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=ciV1Zl7RPeuv0_H4fzZoLA&m=jUxZTm3l_TT87vFfPrlhvWT2POGMswS6vTyIkEjVjyU&s=MkspPcZv9wkPtFIKCq8L3T46uuRlABFW0QcVi6oFrMU&e=>
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org<mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf [beowulf.org]<https://urldefense.proofpoint.com/v2/url?u=https-3A__beowulf.org_cgi-2Dbin_mailman_listinfo_beowulf&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=ciV1Zl7RPeuv0_H4fzZoLA&m=jUxZTm3l_TT87vFfPrlhvWT2POGMswS6vTyIkEjVjyU&s=MkspPcZv9wkPtFIKCq8L3T46uuRlABFW0QcVi6oFrMU&e=>



-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210525/55110943/attachment.htm>


More information about the Beowulf mailing list