I've got 8 linux boxes, what now

Fri Dec 7 15:49:28 PST 2001

On Fri, Dec 07, 2001 at 03:25:15PM -0800, Martin Siegert wrote:

> > First off, you proceed to discuss why you don't like batch queues, but
> > don't talk at all about LVS and other techniques for load balancing.
> > That just means the user types "ssh cluster.sfu.ca" and they end up
> > logging into the node with the lowest load. That's very easy to use.
> 
> Ok. If that's what LVS means, then that's basically what we are doing.

Yes, that's what LVS does. Another way of doing a similar thing is
round robin DNS, but that doesn't take care of down nodes or look at
the actual load.

> > As far as your criticisms of batch queues, you don't have to set them
> > up that way. You can set it up so that all jobs run immediately. That
> > provides a load balancing function, and a central way to figure out
> > your job status. It doesn't provide ideal use of resources in the face
> > of oversubscription, but it can't be gamed by the users.
> 
> That oversubscription is actually what I try to avoid.

Any system which attempts to prevent oversubscription can be
gamed. From your description of your environment, by the way, I had no
idea that you had anything to prevent oversubscription. It sounded
like the reverse, actually.

> > Alternately, you can provide a couple of scripts that do nothing but
> > (1) start a command line on the node with the lowest load, and (2) run
> > ps on all the nodes and grep for that user's username. Same
> > difference.
> 
> What is the lowest load?

That's an interesting question, but I thought we were discussing the
overall architecture here, not the details. The answer is, of course,
that the load is a site specific number. If you want a complicated
policy, the batch system you choose should support complicated
policies. I assure you that there are some nicely complicated
schedulers out there, like Maui, schedulers that do "fair share"
scheduling, and so forth.

> > The nice thing about the batch queue is that it also copes
> > with an MPI cluster in addition to a big pile of interactive nodes.
> 
> Actually that is something I have never figured out: How do you do this?

One way to do it is to have 2 kinds of jobs: serial jobs, and MPI
jobs, with completely different policies and resources.

Another way is to use something like Condor, which can checkpoint
serial jobs to get them out of the way of MPI jobs. Or other batch
queue systems which support backfill. Of course, the decisions
themselves are policy items, and you'll have to invent a policy.

> > Or if you used Condor as your batch queue, you could add some desktop
> > machines to the cluster, for additional oompf at night.
> 
> This wouldn't help in our case: we have ample supply of machines that
> can run jobs for a few hours. We need facilities for jobs that run days,
> weeks, months.

Condor's checkpointing allows it to take advantage of machines
available for just a few hours, even if the jobs end up running for
days, weeks, or months.

> Again: YMMV. I just tried to point out that batch systems
> may not work very well in a university environment.

Sure, I'm sure everyone agrees that batch systems are not a panacea.
However, there are a lot of different systems, and a lot of different
ways to set them up.

greg