[Beowulf] Bright Cluster Manager

Jörg Saßmannshausen sassy-work at sassy.formativ.net
Wed May 2 14:04:45 PDT 2018


Dear Chris,

further to your email:

> - And if miracles occur and they do have expert level linux people then
> more often than not these people are overworked or stretched in many
> directions

This is exactly what has happened to me at the old work place: pulled into too 
many different directions. 

I am a bit surprised about the ZFS experiences. Although I did not have 
petabyte of storage and I did not generate 300 TB per week, I did have a 
fairly large storage space running on xfs and ext4 for backups and 
provisioning of file space. Some of it was running on old hardware (please sit 
down, I am talking about me messing around with SCSI cables) and I gradually 
upgraded to newer one. So, I am not quite sure what went wrong with the ZFS 
storage here. 

However, there is a common trend, at least what I observe here in the UK, to 
out-source problems: pass the bucket to somebody else and we pay for it. 
I am personally still  more of an in-house expert than an out-sourced person 
who may or may not be able to understand what you are doing. 
I should add I am working in academia and I know little about the commercial 
world here. Having said that, my friends in commerce are telling me that the 
company likes to outsource as it is 'cheaper'. 
I agree with the Linux expertise. I think I am one of the two who are Linux 
admins in the present work place. The official line is: we do not support Linux 
(but we teach it). 

Anyhow, I don't want to digress here too much. However, "..do HPC work in 
commercial environments where the skills simply don't exist onsite."
Are we a dying art?

My 1 shilling here from a still cold and dark London.

Jörg



Am Mittwoch, 2. Mai 2018, 16:19:48 BST schrieb Chris Dagdigian:
> Jeff White wrote:
> > I never used Bright.  Touched it and talked to a salesperson at a
> > conference but I wasn't impressed.
> > 
> > Unpopular opinion: I don't see a point in using "cluster managers"
> > unless you have a very tiny cluster and zero Linux experience.  These
> > are just Linux boxes with a couple applications (e.g. Slurm) running
> > on them.  Nothing special. xcat/Warewulf/Scyld/Rocks just get in the
> > way more than they help IMO.  They are mostly crappy wrappers around
> > free software (e.g. ISC's dhcpd) anyway.  When they aren't it's
> > proprietary trash.
> > 
> > I install CentOS nodes and use
> > Salt/Chef/Puppet/Ansible/WhoCares/Whatever to plop down my configs and
> > software.  This also means I'm not suck with "node images" and can
> > instead build everything as plain old text files (read: write
> > SaltStack states), update them at will, and push changes any time.  My
> > "base image" is CentOS and I need no "baby's first cluster" HPC
> > software to install/PXEboot it.  YMMV
> 
> Totally legit opinion and probably not unpopular at all given the user
> mix on this list!
> 
> The issue here is assuming a level of domain expertise with Linux,
> bare-metal provisioning, DevOps and (most importantly) HPC-specific
> configStuff that may be pervasive or easily available in your
> environment but is often not easily available in a
> commercial/industrial  environment where HPC or "scientific computing"
> is just another business area that a large central IT organization must
> support.
> 
> If you have that level of expertise available then the self-managed DIY
> method is best. It's also my preference
> 
> But in the commercial world where HPC is becoming more and more
> important you run into stuff like:
> 
> - Central IT may not actually have anyone on staff who knows Linux (more
> common than you expect; I see this in Pharma/Biotech all the time)
> 
> - The HPC user base is not given budget or resource to self-support
> their own stack because of a drive to centralize IT ops and support
> 
> - And if they do have Linux people on staff they may be novice-level
> people or have zero experience with HPC schedulers, MPI fabric tweaking
> and app needs (the domain stuff)
> 
> - And if miracles occur and they do have expert level linux people then
> more often than not these people are overworked or stretched in many
> directions
> 
> 
> So what happens in these environments is that organizations will
> willingly (and happily) pay commercial pricing and adopt closed-source
> products if they can deliver a measurable reduction in administrative
> burden, operational effort or support burden.
> 
> This is where Bright, Univa etc. all come in -- you can buy stuff from
> them that dramatically reduces that onsite/local IT has to manage the
> care and feeding of.
> 
> Just having a vendor to call for support on Grid Engine oddities makes
> the cost of Univa licensing worthwhile. Just having a vendor like Bright
> be on the hook for "cluster operations" is a huge win for an overworked
> IT staff that does not have linux or HPC specialists on-staff or easily
> available.
> 
> My best example of "paying to reduce operational burden in HPC" comes
> from a massive well known genome shop in the cambridge, MA area. They
> often tell this story:
> 
> - 300 TB of new data generation per week (many years ago)
> - One of the initial storage tiers was ZFS running on commodity server
> hardware
> - Keeping the DIY ZFS appliances online and running took the FULL TIME
> efforts of FIVE STORAGE ENGINEERS
> 
> They realized that staff support was not scalable with DIY/ZFS at
> 300TB/week of new data generation so they went out and bought a giant
> EMC Isilon scale-out NAS platform
> 
> And you know what? After the Isilon NAS was deployed the management of
> *many* petabytes of single-namespace storage was now handled by the IT
> Director in his 'spare time' -- And the five engineers who used to do
> nothing but keep ZFS from falling over were re-assigned to more
> impactful and presumably more fun/interesting work.
> 
> 
> They actually went on stage at several conferences and told the story of
> how Isilon allowed senior IT leadership to manage petabyte volumes of
> data "in their spare time" -- this was a huge deal and really resonated
> . Really reinforced for me how in some cases it's actually a good idea
> to pay $$$ for commercial stuff if it delivers gains in
> ops/support/management.
> 
> 
> Sorry to digress! This is a topic near and dear to me. I often have to
> do HPC work in commercial environments where the skills simply don't
> exist onsite. Or more commonly -- they have budget to buy software or
> hardware but they are under a hiring freeze and are not allowed to bring
> in new Humans.
> 
> Quite a bit of my work on projects like this is helping people make
> sober decisions regarding "build" or "buy" -- and in those environments
> it's totally clear that for some things it makes sense for them to pay
> for an expensive commercially supported "thing" that they don't have to
> manage or support themselves
> 
> 
> My $.02 ...
> 
> 
> 
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list