[Beowulf] Bright Cluster Manager

Wed May 2 13:19:48 PDT 2018

Jeff White wrote:
>
> I never used Bright.  Touched it and talked to a salesperson at a 
> conference but I wasn't impressed.
>
> Unpopular opinion: I don't see a point in using "cluster managers" 
> unless you have a very tiny cluster and zero Linux experience.  These 
> are just Linux boxes with a couple applications (e.g. Slurm) running 
> on them.  Nothing special. xcat/Warewulf/Scyld/Rocks just get in the 
> way more than they help IMO.  They are mostly crappy wrappers around 
> free software (e.g. ISC's dhcpd) anyway.  When they aren't it's 
> proprietary trash.
>
> I install CentOS nodes and use 
> Salt/Chef/Puppet/Ansible/WhoCares/Whatever to plop down my configs and 
> software.  This also means I'm not suck with "node images" and can 
> instead build everything as plain old text files (read: write 
> SaltStack states), update them at will, and push changes any time.  My 
> "base image" is CentOS and I need no "baby's first cluster" HPC 
> software to install/PXEboot it.  YMMV
>

Totally legit opinion and probably not unpopular at all given the user 
mix on this list!

The issue here is assuming a level of domain expertise with Linux, 
bare-metal provisioning, DevOps and (most importantly) HPC-specific 
configStuff that may be pervasive or easily available in your 
environment but is often not easily available in a 
commercial/industrial  environment where HPC or "scientific computing" 
is just another business area that a large central IT organization must 
support.

If you have that level of expertise available then the self-managed DIY 
method is best. It's also my preference

But in the commercial world where HPC is becoming more and more 
important you run into stuff like:

- Central IT may not actually have anyone on staff who knows Linux (more 
common than you expect; I see this in Pharma/Biotech all the time)

- The HPC user base is not given budget or resource to self-support 
their own stack because of a drive to centralize IT ops and support

- And if they do have Linux people on staff they may be novice-level 
people or have zero experience with HPC schedulers, MPI fabric tweaking 
and app needs (the domain stuff)

- And if miracles occur and they do have expert level linux people then 
more often than not these people are overworked or stretched in many 
directions

So what happens in these environments is that organizations will 
willingly (and happily) pay commercial pricing and adopt closed-source 
products if they can deliver a measurable reduction in administrative 
burden, operational effort or support burden.

This is where Bright, Univa etc. all come in -- you can buy stuff from 
them that dramatically reduces that onsite/local IT has to manage the 
care and feeding of.

Just having a vendor to call for support on Grid Engine oddities makes 
the cost of Univa licensing worthwhile. Just having a vendor like Bright 
be on the hook for "cluster operations" is a huge win for an overworked 
IT staff that does not have linux or HPC specialists on-staff or easily 
available.

My best example of "paying to reduce operational burden in HPC" comes 
from a massive well known genome shop in the cambridge, MA area. They 
often tell this story:

- 300 TB of new data generation per week (many years ago)
- One of the initial storage tiers was ZFS running on commodity server 
hardware
- Keeping the DIY ZFS appliances online and running took the FULL TIME 
efforts of FIVE STORAGE ENGINEERS

They realized that staff support was not scalable with DIY/ZFS at 
300TB/week of new data generation so they went out and bought a giant 
EMC Isilon scale-out NAS platform

And you know what? After the Isilon NAS was deployed the management of 
*many* petabytes of single-namespace storage was now handled by the IT 
Director in his 'spare time' -- And the five engineers who used to do 
nothing but keep ZFS from falling over were re-assigned to more 
impactful and presumably more fun/interesting work.

They actually went on stage at several conferences and told the story of 
how Isilon allowed senior IT leadership to manage petabyte volumes of 
data "in their spare time" -- this was a huge deal and really resonated 
. Really reinforced for me how in some cases it's actually a good idea 
to pay $$$ for commercial stuff if it delivers gains in 
ops/support/management.

Sorry to digress! This is a topic near and dear to me. I often have to 
do HPC work in commercial environments where the skills simply don't 
exist onsite. Or more commonly -- they have budget to buy software or 
hardware but they are under a hiring freeze and are not allowed to bring 
in new Humans.

Quite a bit of my work on projects like this is helping people make 
sober decisions regarding "build" or "buy" -- and in those environments 
it's totally clear that for some things it makes sense for them to pay 
for an expensive commercially supported "thing" that they don't have to 
manage or support themselves

My $.02 ...