[Beowulf] question about enforcement of scheduler use
Andrew D. Fant
fant at pobox.com
Mon May 22 18:05:02 PDT 2006
Larry,
I'll echo what Chris said about not seeking a technical solution for a
political problem. You can't solve the latter by yourself, either. Get the
users involved and the people who write their evaluations, and agree to
acceptable terms of use and some sort of penalty for violating them.
Having said that, there are various technical things that one can do to
limit the ability of a casually frustrated user to game the system. One of my
favorites involves putting a monitoring system in place that counts logins and
lets you know when the login count goes over a given number (in my case, I like
to set it to 1, so that I can get in to fix things in a shell if I have to,
though you want to minimize this in a well-run cluster). This relies on the
batch system not writing to wtmp and starting a login session for users, so it
might not work on PBSPro, but I like it.
One other possibilty that you might consider if you are somewhat desperate
and you have a terminal server and serial port console on all the systems (or a
separate management host that users cannot access) is to put a firewall rule or
tcpwrappers rule in place that prevents ssh connections from the head node to
the compute nodes. Normally, I don't like firewalls on compute nodes, because
it adds to the failure modes in glorious and obscure ways, but this might be one
way to buy time in an arms race to get management to understand a problem. You
probably will want to put a reverse rule in place as well, to keep users from
submitting jobs to start ssh or sshd on a compute node and start a tunnel back
to the head node that they can access. Again, if it reaches this point, you
probably already have a problem above your pay grade.
The last bit of advice that I will toss out is that you may want to
seriously look at LSF or GridEngine for your cluster. PBSPro does have
commercial support behind it, but from what I have seen, it's salad days are
behind it. If you need commercial support and industrial strength integration,
LSF is the market leader at this point, and if you are in need of low cost and
current technologies, GridEngine is seeing consistent growth in user base and
development.
I see you are at Georgia State. If you want to talk to someone face to
face and have a real conversation about cluster management, email me off-list.
I know some people in Atlanta who might be willing to give some advice to
someone who has been thrown to the wulfs, as it were.
HTH,
Andy
--
Andrew Fant | And when the night is cloudy | This space to let
Molecular Geek | There is still a light |----------------------
fant at pobox.com | That shines on me | Disclaimer: I don't
Boston, MA | Shine until tomorrow, Let it be | even speak for myself
More information about the Beowulf
mailing list