[Beowulf] first cluster

Prentice Bisbal prentice at ias.edu
Fri Jul 16 10:13:55 PDT 2010

> There is one question that perplexes me, to which I have not found an
> answer.
> How does the presence of a job scheduler interact with the ability of a user to
>   ssh to <head>, 
>   ssh to <compute-node-n>, and then type 
>   mpirun -np 64 my_application
> Intuition tells me there has to be something in a cluster setup, when
> it has a scheduler, that prevents a user from circumventing the
> scheduler by doing something like the above.

That is definitely a problem that must be dealt with. No point in having
a schedluer if everyone bypasses it. There are a few ways you could do
it. Here's how I do it:

My cluster mounts the same NFS file systems (/home directories,
/usr/local, etc.) as all the user workstations, and our more powerful
multi-user 64-bit servers with lots of RAM. We call the latter 'compute
servers' (just to avoid confusion on the list with compute nodes in the
cluster). The compute servers are outside the cluster network, but can
communicate with the head node.

I use SGE, which separates the roles of compute host, submission host,
and administration host (not sure if other resource managers behave the
same way. A member of an SGE cluster can be any combination of these 3
things. Our computer servers are set up as submit hosts, so users can
use them to compile their programs and submit jobs, and check on the
status, without ever actually logging into any cluster node at all. The
SSH configuration in the head node prevents anyone other than the
administrative staff from logging in, and the rest of the cluster is on
a private network, so the only way to run a job in this was is through
submitting a batch job through SGE.

The only drawback of this system, is that users cannot request
interactive jobs on my cluster, but I don't see that as a very big
problem, since most cluster jobs are batch jobs anyway.

Not sure if you can do the same thing with other resource managers, SGE
is the only one I've used. Not sure if other resource managers still
rely on rsh/ssh to start jobs. If they do, that can add some complexity
to allowing the cluster nodes to allow jobs to run, but disallow
interactive logins.


More information about the Beowulf mailing list