On 25/07/13 14:40, Mark Hahn wrote:

> do you really find users who decide to choose their own nodes?

In the past yes, they've come from places who either haven't had a
queuing system or who haven't use HPC before and haven't read the docs
or been to the courses.

> limiting ssh access, done right, can permit (c) and prevent (a).

That's what we do.  Users can login to nodes their jobs are on. I'm
hoping that the aims of the Slurm PAM module to be able to move users
SSHing into the node into the cgroup for their jobs will get
implemented.   That way if they do login and run stuff that impacts
they'll only hurt their own jobs.

> we don't really see (a) enough to worry about it (we're pretty big 
> on at least basic user inculcation...)  and most of (b) I see is 
> actually not helped, since the rogue jobs are usually escapees, 
> rather than mis-aimed.

Yeah, we see rogue jobs and have health check scripts that can fix
them up for the simple cases (and alert us and take the node offline
for others).  That helps with having to deal with the emails from
users asking why their jobs are running slower than usual.

> I suppose you could charge by utime+stime rather than real time.

That would mean a lot of extra hacking around as we're using Gold
(with Torque and Moab) at the moment and will be moving to Slurm in
the very near future (as it's what we run on our BG/Q), so we bend to
their whim on charging.

