[Beowulf] Bright Cluster Manager
Chris Dagdigian
dag at sonsorol.org
Wed May 2 13:19:48 PDT 2018
Jeff White wrote:
>
> I never used Bright. Touched it and talked to a salesperson at a
> conference but I wasn't impressed.
>
> Unpopular opinion: I don't see a point in using "cluster managers"
> unless you have a very tiny cluster and zero Linux experience. These
> are just Linux boxes with a couple applications (e.g. Slurm) running
> on them. Nothing special. xcat/Warewulf/Scyld/Rocks just get in the
> way more than they help IMO. They are mostly crappy wrappers around
> free software (e.g. ISC's dhcpd) anyway. When they aren't it's
> proprietary trash.
>
> I install CentOS nodes and use
> Salt/Chef/Puppet/Ansible/WhoCares/Whatever to plop down my configs and
> software. This also means I'm not suck with "node images" and can
> instead build everything as plain old text files (read: write
> SaltStack states), update them at will, and push changes any time. My
> "base image" is CentOS and I need no "baby's first cluster" HPC
> software to install/PXEboot it. YMMV
>
Totally legit opinion and probably not unpopular at all given the user
mix on this list!
The issue here is assuming a level of domain expertise with Linux,
bare-metal provisioning, DevOps and (most importantly) HPC-specific
configStuff that may be pervasive or easily available in your
environment but is often not easily available in a
commercial/industrial environment where HPC or "scientific computing"
is just another business area that a large central IT organization must
support.
If you have that level of expertise available then the self-managed DIY
method is best. It's also my preference
But in the commercial world where HPC is becoming more and more
important you run into stuff like:
- Central IT may not actually have anyone on staff who knows Linux (more
common than you expect; I see this in Pharma/Biotech all the time)
- The HPC user base is not given budget or resource to self-support
their own stack because of a drive to centralize IT ops and support
- And if they do have Linux people on staff they may be novice-level
people or have zero experience with HPC schedulers, MPI fabric tweaking
and app needs (the domain stuff)
- And if miracles occur and they do have expert level linux people then
more often than not these people are overworked or stretched in many
directions
So what happens in these environments is that organizations will
willingly (and happily) pay commercial pricing and adopt closed-source
products if they can deliver a measurable reduction in administrative
burden, operational effort or support burden.
This is where Bright, Univa etc. all come in -- you can buy stuff from
them that dramatically reduces that onsite/local IT has to manage the
care and feeding of.
Just having a vendor to call for support on Grid Engine oddities makes
the cost of Univa licensing worthwhile. Just having a vendor like Bright
be on the hook for "cluster operations" is a huge win for an overworked
IT staff that does not have linux or HPC specialists on-staff or easily
available.
My best example of "paying to reduce operational burden in HPC" comes
from a massive well known genome shop in the cambridge, MA area. They
often tell this story:
- 300 TB of new data generation per week (many years ago)
- One of the initial storage tiers was ZFS running on commodity server
hardware
- Keeping the DIY ZFS appliances online and running took the FULL TIME
efforts of FIVE STORAGE ENGINEERS
They realized that staff support was not scalable with DIY/ZFS at
300TB/week of new data generation so they went out and bought a giant
EMC Isilon scale-out NAS platform
And you know what? After the Isilon NAS was deployed the management of
*many* petabytes of single-namespace storage was now handled by the IT
Director in his 'spare time' -- And the five engineers who used to do
nothing but keep ZFS from falling over were re-assigned to more
impactful and presumably more fun/interesting work.
They actually went on stage at several conferences and told the story of
how Isilon allowed senior IT leadership to manage petabyte volumes of
data "in their spare time" -- this was a huge deal and really resonated
. Really reinforced for me how in some cases it's actually a good idea
to pay $$$ for commercial stuff if it delivers gains in
ops/support/management.
Sorry to digress! This is a topic near and dear to me. I often have to
do HPC work in commercial environments where the skills simply don't
exist onsite. Or more commonly -- they have budget to buy software or
hardware but they are under a hiring freeze and are not allowed to bring
in new Humans.
Quite a bit of my work on projects like this is helping people make
sober decisions regarding "build" or "buy" -- and in those environments
it's totally clear that for some things it makes sense for them to pay
for an expensive commercially supported "thing" that they don't have to
manage or support themselves
My $.02 ...
More information about the Beowulf
mailing list