[Beowulf] Most common cluster management software, job schedulers, etc?

Wed Mar 9 16:34:06 PST 2016

Thanks everyone!  Your replies were very helpful.

> 
> 
>> On Mar 8, 2016, at 2:49 PM, Christopher Samuel <samuel at unimelb.edu.au> wrote:
>> 
>> On 08/03/16 15:43, Jeff Friedman wrote:
>> 
>>> Hello all. I am just entering the HPC Sales Engineering role, and would
>>> like to focus my learning on the most relevant stuff. I have searched
>>> near and far for a current survey of some sort listing the top used
>>> “stacks”, but cannot seem to find one that is free. I was breaking
>>> things down similar to this:
>> 
>> All the following is just for us, but in your role you'll probably need
>> to be familiar with most options I would have thought based on customer
>> requirements.  Specialisation for your preferred suite is down to you of
>> course!
>> 
>>> _OS disto_:  CentOS, Debian, TOSS, etc?  I know some come trimmed down,
>>> and also include specific HPC libraries, like CNL, CNK, INK?  
>> 
>> RHEL - hardware support attitude of "we support both types of Linux,
>> RHEL and SLES".
>> 
>>> _MPI options_: MPICH2, MVAPICH2, Open MPI, Intel MPI, ? 
>> 
>> Open-MPI
>> 
>>> _Provisioning software_: Cobbler, Warewulf, xCAT, Openstack, Platform HPC, ?
>> 
>> xCAT
>> 
>>> _Configuration management_: Warewulf, Puppet, Chef, Ansible, ? 
>> 
>> xCAT
>> 
>> We use Puppet on for infrastructure VMs (running Debian).
>> 
>>> _Resource and job schedulers_: I think these are basically the same
>>> thing? Torque, Lava, Maui, Moab, SLURM, Grid Engine, Son of Grid Engine,
>>> Univa, Platform LSF, etc… others?
>> 
>> Yes and no - we run Slurm and use its own scheduling mechanisms but you
>> could plug in Moab should you wish.
>> 
>> Torque has an example pbs_sched but that's just a FIFO, you'd want to
>> look at Maui or Moab for more sophisticated scheduling.
>> 
>>> _Shared filesystems_: NFS, pNFS, Lustre, GPFS, PVFS2, GlusterFS, ? 
>> 
>> GPFS here - copes well with lots of small files (looks at one OpenFOAM
>> project that has over 19 million files & directories - mostly
>> directories - and sighs).
>> 
>>> _Library management_: Lmod, ? 
>> 
>> I've been using environment modules for almost a decade now but our
>> recent cluster has switched to Lmod.
>> 
>>> _Performance monitoring_: Ganglia, Nagios, ?
>> 
>> We use Icinga for monitoring infrastructure, including polling xCAT and
>> Slurm for node information such as error LEDs, down nodes, etc.
>> 
>> We have pnp4nagios integrated with our Icinga to record time series
>> information about memory usage, etc.
>> 
>>> _Cluster management toolkits_: I believe these perform many of the
>>> functions above, all wrapped up in one tool?  Rocks, Oscar, Scyld, Bright, ?
>> 
>> N/A here.
>> 
>> All the best!
>> Chris
>> -- 
>> Christopher Samuel        Senior Systems Administrator
>> VLSCI - Victorian Life Sciences Computation Initiative
>> Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
>> http://www.vlsci.org.au/      http://twitter.com/vlsci
>> 
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20160309/38e91d9f/attachment.html>