[Beowulf] Anyone have docs/URLs/resources covering Grid Engine vs SLURM deltas or migration?

Fri Jan 29 20:12:57 UTC 2021

Hi folks,

Those who know me from my day job and other email address know that I've 
been a hardcore SGE person for a decade+ now.  Feels bad to even write 
this email, heh

Internally at my company we've been having a great internal discussion 
about SGE and SLURM, the specific trigger being our widespread use of 
AWS Parallelcluster for auto-scaling compute farms on AWS and the 
decision by AWS to deprecate the open source SGE distributed with the 
stack at some point in the future.

Related to this is the longstanding use of SGE in the life sciences -- 
there are genome sequencers and other wet lab instruments that ship 
natively with SGE support from the vendor so there is going to be a long 
tail of SGE still in use or targeted for use in biotech and pharma spaces.

I think there is still a future for commercial SGE especially after the 
Univa -> Altair tie up but that still leaves the poor orphaned/forked 
open source SGE distro's still kinda hanging out there with no real 
updates or improvements in ages so I am understanding of the AWS HPC 
folk desire to pare down their supported scheduler stack.

I want to build up my own knowledge and prep for our own increased use 
of SLURM on AWS.  I'm very comfortable with SGE architecture, 
operational philosophy and capabilities but I lack similar info for 
modern SLURM.   I'm ready and willing to start from scratch to build my 
own transition and "differences between SGE/SLURM" documentation but was 
wondering who out there has made this transition before and if there are 
any public domain FAQs, wikis, technical writeups or other guidance that 
I can learn from.

If I can manage to put my own materials together and they look sensible 
I will plan on publishing them openly. Thanks!

So far the internal conversation we are having is centering on the 
differences in resource based job scheduling when there are specific 
needs to declare required resources up front like GPUs or memory 
requirements.  Most of the differences beyond basic queue/partition 
design seem to center around the minutiae of scheduling and placing jobs 
but I'm sure I'm missing other larger areas.

Regards
Chris