I've got 8 linux boxes, what now

Fri Dec 7 15:25:15 PST 2001

On Fri, Dec 07, 2001 at 05:08:18PM -0500, Greg Lindahl wrote:
> On Fri, Dec 07, 2001 at 01:48:59PM -0800, Martin Siegert wrote:
> 
> > At that time I had to make a decision about loadbalancing and a batch
> > queing system. I decided to have none of it.
> 
> First off, you proceed to discuss why you don't like batch queues, but
> don't talk at all about LVS and other techniques for load balancing.
> That just means the user types "ssh cluster.sfu.ca" and they end up
> logging into the node with the lowest load. That's very easy to use.

Ok. If that's what LVS means, then that's basically what we are doing.
(Not with that type of automatism - all users must login to the master
node and can submit jobs from there. Provided with a script that ouputs the
running jobs on the cluster users are very good in figuring out which node
is best for them).

> As far as your criticisms of batch queues, you don't have to set them
> up that way. You can set it up so that all jobs run immediately. That
> provides a load balancing function, and a central way to figure out
> your job status. It doesn't provide ideal use of resources in the face
> of oversubscription, but it can't be gamed by the users.

That oversubscription is actually what I try to avoid.

> Alternately, you can provide a couple of scripts that do nothing but
> (1) start a command line on the node with the lowest load, and (2) run
> ps on all the nodes and grep for that user's username. Same
> difference.

What is the lowest load? A node with two jobs with nice 0 or a node with
five jobs at nice 19 (we have a rule that every user can start a two jobs
with nice 0 and as arbitrary many jobs with nice 19)? Depending on your
set of rules/guidelines the answer to that question may vary.
The problem with automatic assignments of jobs to nodes is that it isn't
very flexible. E.g., I don't mind that a user starts as many jobs as (s)he
want as long as there are idle processors. But on the other hand I request
that users who run a large number of jobs stop jobs as soon as they
prevent other users from doing anything. By reuesting this kind of
action from the users we were able to achieve a better and fairer
utilization than with any batch system I have ever worked with.
Naturally, we are not a vendor who sells systems, thus YMMV.

> The nice thing about the batch queue is that it also copes
> with an MPI cluster in addition to a big pile of interactive nodes.

Actually that is something I have never figured out: How do you do this?
Situation: every processor on the cluster has (at least) one job running
on it. Somebody submits a MPI job for 4 processors. Do you wait until
you have 4 idle processors? That'll require to let up to 3 processors
idle for and extended time period. Or do you start the MPI job immediately
(or when one processor becomes idle)? In that case the timing of your
MPI program becomes unpredictable due to different loads on the processors.

> Or if you used Condor as your batch queue, you could add some desktop
> machines to the cluster, for additional oompf at night.

This wouldn't help in our case: we have ample supply of machines that
can run jobs for a few hours. We need facilities for jobs that run days,
weeks, months. Again: YMMV. I just tried to point out that batch systems
may not work very well in a university environment.

Martin

========================================================================
Martin Siegert
Academic Computing Services                        phone: (604) 291-4691
Simon Fraser University                            fax:   (604) 291-4242
Burnaby, British Columbia                          email: siegert at sfu.ca
Canada  V5A 1S6
========================================================================