[Beowulf] Please help to setup Beowulf

Thu Feb 19 01:33:00 PST 2009

On Wednesday 18 February 2009 16:30:37 Mark Hahn wrote:
> if SGE has to keep re-reading user files, this suggests to me that its
> design is poor.  it's obvious to me that a scheduler should be based on
> a production-quality DB, for instance, and should clearly give some
> thought to performance when there are many runnable jobs.

IIRC, LSF, which to me qualifies as a production-quality scheduler, stores its 
jobs status in text files, parses them every time it needs it, and also reads 
each job definition right before starting them, because it has no way of 
knowing if the resource requirements are the same as the previous job's, 
unless they're part of an array. 
(This was for versions 6.x, and may have changed since).

> is it?  an array job, in my experience at least, is just syntactic sugar
> for submission.  

And also for managing later on. It's much easier to issue a 

	bjobs job_ID[low_index-high_index:step] 

and get details about the ones of your job you especially care, in the logical 
order which is relevant to your problem, rather than having to store an 
arbitrary list of random job numbers, which only make sense to the scheduler 
internals.

> all the internals of job scheduling still have to operate,

Yes, but they only affect the job_ID, which is unique for your array, rather 
than each individual job number. You can define the job array's indexes as you 
want (as you submit them).

> though as you point out, the scheduler may take advantage of the fact that
> each sub-job has identical resource requirements.  it still needs to
> dispatch each sub-job separately, each SJ may fail in unique ways, etc.  

True.

> in
> the end, the submission and resource-matching code of the scheduler has
> gotten a break, but everything else works just as hard.  consider, for
> instance, that flat naming schemes are conceptually simpler, but array jobs
> break that. 

I'm not sure what you mean by that. For LSF at least, jobs can be given a 
job_name, including job arrays.

> remember that the user still has to write some sort of script
> to evaluate whether each SJ worked, and recover from the failures.

Sure, job arrays won't make errors go away. But they will be easier to track, 
because it's easier to see that you need to resubmit the first 3 jobs of your 
array, rather than job 542323, job 542326 and job 543623. You just have one 
job-ID to track.

You can get a quick summarized view of the status of your jobs in an array 
without needing to feed the scheduler with a list of arbitrary job numbers. 
You can also submit jobs with dependency conditions on the status of other 
jobs, like "run job[2] only if job[1] succeeded". This is much harder to do 
when you submit jobs individually, because you can't know job numbers in 
advance. 

Well, all this to say that array jobs can prove useful. :)

Cheers,
-- 
Kilian