[Beowulf] Bright Cluster Manager

Thu May 3 00:45:52 PDT 2018

Regarding storage, Chris Dagdigian comments:

>And you know what? After the Isilon NAS was deployed the management of
*many* petabytes of single-namespace storage was now handled by the IT
Director in his 'spare time' -- And the five engineers who used to do
nothing > >but keep ZFS from falling over were re-assigned to more
impactful and presumably more fun/interesting work.

The person who runs the huge JASMIN climate research project in the UK
makes the same comment, only with Panasas storage.
He is able to manage petabytes of Panasas storage with himself and one
other person. A lto of that storage installed by my fair hands.
To be honest though installing Panasas is a matter of how fast you can
unbox the blades  (*)

(*) Well, that is not so in real life! During that install we had several
'funnies' - all of which were diagnosed and a fix given by the superb
Panasas support.
Including the shelf where after replacing every component over the period
of two weeks - something like Triggers Broom
http://foolsandhorses.weebly.com/triggers-broom.html
we at last found the bent pin in the multiway connector (ahem)

On 3 May 2018 at 09:23, John Hearns <hearnsj at googlemail.com> wrote:

> Jorg,  I did not know that you used Bright.  Or I may have forgotten!
> I thought you were a Debian fan.  Of relevance, Bright 8 now supports
> Debian.
>
> You commented on the Slurm configuration file being changed.
> I found during the install at Greenwich, where we put in a custom
> slurm.conf, that Bright has an option
> to 'freeze' files. This is defined in the cmd.conf file.  So if new nodes
> are added, or other changes made,
> the slurm.conf gile is left unchanged and you have to manually manage it.
> I am not 100% sure what happens with an update of the RPMs, but I would
> imagine the freeze state is respected.
>
>
> >I should add I am working in academia and I know little about the
> commercial
> >world here. Having said that, my friends in commerce are telling me that
> the
> >company likes to outsource as it is 'cheaper'.
> I would not say cheaper. However (see below) HPC skills are scarce.
> And if you are in industry you commit to your management that HPC
> resources will be up and running
> for XX % of a year - ie you have some explaining to do if there is
> extended downtime.
> HPC is looked upon as something comparable to machine tools - in Formula 1
> we competed for beudget against
> fize axis milling machines for instance. Can you imagine what would happen
> if the machine shop supervisor said
> "Sorry - no parts being made today. My guys have the covers off and we are
> replacing one of the motors with one we got off Ebay"
>
>
> So yes you do want commercial support for aspects of your setup - let us
> say that jobs are going into hold states
> on your batch system, or jobs are immediately terminating. Do you:
>
> a) spend all day going through logs with a fine tooth comb, and send out
> an email to the Slurm/PBS/SGE list and hope you get
> some sort of answer
>
> b) take a dump of the relevant logs and get a ticket opened with your
> support people
>
> Actually in real life you do both, but path (b) is going to get you up and
> running quicker.
>
> Also for storage, in industry you really want support on your storage.
>
>
>
>
> >Anyhow, I don't want to digress here too much. However, "..do HPC work in
> >commercial environments where the skills simply don't exist onsite."
> >Are we a dying art?
>
> Jorg, yes. HPC skills are rare, as are the people who take the time and
> trouble to learn deeply about the systems they operate.
> I know this as recruitment consultants tell me this regularly.
> I find that often in life people do the minimum they need, and once they
> are given instructions they never change,
> even when the configuration steps they carry out have lost meaning.
> I have met that attitude in several companies. Echoing Richard Feynman I
> call this 'cargo cult systems'
> The people like you who are willing to continually learn and to abandon
> old ways of work
> are invaluable.
>
> I am consulting at the moment with a biotech firm in Denmark. Replying to
> Chris Dagdigian, this company does have excellent in-house
> Linux skills, so I suppose is the exception to the rule!
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On 2 May 2018 at 23:04, Jörg Saßmannshausen <sassy-work at sassy.formativ.net
> > wrote:
>
>> Dear Chris,
>>
>> further to your email:
>>
>> > - And if miracles occur and they do have expert level linux people then
>> > more often than not these people are overworked or stretched in many
>> > directions
>>
>> This is exactly what has happened to me at the old work place: pulled
>> into too
>> many different directions.
>>
>> I am a bit surprised about the ZFS experiences. Although I did not have
>> petabyte of storage and I did not generate 300 TB per week, I did have a
>> fairly large storage space running on xfs and ext4 for backups and
>> provisioning of file space. Some of it was running on old hardware
>> (please sit
>> down, I am talking about me messing around with SCSI cables) and I
>> gradually
>> upgraded to newer one. So, I am not quite sure what went wrong with the
>> ZFS
>> storage here.
>>
>> However, there is a common trend, at least what I observe here in the UK,
>> to
>> out-source problems: pass the bucket to somebody else and we pay for it.
>> I am personally still  more of an in-house expert than an out-sourced
>> person
>> who may or may not be able to understand what you are doing.
>> I should add I am working in academia and I know little about the
>> commercial
>> world here. Having said that, my friends in commerce are telling me that
>> the
>> company likes to outsource as it is 'cheaper'.
>> I agree with the Linux expertise. I think I am one of the two who are
>> Linux
>> admins in the present work place. The official line is: we do not support
>> Linux
>> (but we teach it).
>>
>> Anyhow, I don't want to digress here too much. However, "..do HPC work in
>> commercial environments where the skills simply don't exist onsite."
>> Are we a dying art?
>>
>> My 1 shilling here from a still cold and dark London.
>>
>> Jörg
>>
>>
>>
>> Am Mittwoch, 2. Mai 2018, 16:19:48 BST schrieb Chris Dagdigian:
>> > Jeff White wrote:
>> > > I never used Bright.  Touched it and talked to a salesperson at a
>> > > conference but I wasn't impressed.
>> > >
>> > > Unpopular opinion: I don't see a point in using "cluster managers"
>> > > unless you have a very tiny cluster and zero Linux experience.  These
>> > > are just Linux boxes with a couple applications (e.g. Slurm) running
>> > > on them.  Nothing special. xcat/Warewulf/Scyld/Rocks just get in the
>> > > way more than they help IMO.  They are mostly crappy wrappers around
>> > > free software (e.g. ISC's dhcpd) anyway.  When they aren't it's
>> > > proprietary trash.
>> > >
>> > > I install CentOS nodes and use
>> > > Salt/Chef/Puppet/Ansible/WhoCares/Whatever to plop down my configs
>> and
>> > > software.  This also means I'm not suck with "node images" and can
>> > > instead build everything as plain old text files (read: write
>> > > SaltStack states), update them at will, and push changes any time.  My
>> > > "base image" is CentOS and I need no "baby's first cluster" HPC
>> > > software to install/PXEboot it.  YMMV
>> >
>> > Totally legit opinion and probably not unpopular at all given the user
>> > mix on this list!
>> >
>> > The issue here is assuming a level of domain expertise with Linux,
>> > bare-metal provisioning, DevOps and (most importantly) HPC-specific
>> > configStuff that may be pervasive or easily available in your
>> > environment but is often not easily available in a
>> > commercial/industrial  environment where HPC or "scientific computing"
>> > is just another business area that a large central IT organization must
>> > support.
>> >
>> > If you have that level of expertise available then the self-managed DIY
>> > method is best. It's also my preference
>> >
>> > But in the commercial world where HPC is becoming more and more
>> > important you run into stuff like:
>> >
>> > - Central IT may not actually have anyone on staff who knows Linux (more
>> > common than you expect; I see this in Pharma/Biotech all the time)
>> >
>> > - The HPC user base is not given budget or resource to self-support
>> > their own stack because of a drive to centralize IT ops and support
>> >
>> > - And if they do have Linux people on staff they may be novice-level
>> > people or have zero experience with HPC schedulers, MPI fabric tweaking
>> > and app needs (the domain stuff)
>> >
>> > - And if miracles occur and they do have expert level linux people then
>> > more often than not these people are overworked or stretched in many
>> > directions
>> >
>> >
>> > So what happens in these environments is that organizations will
>> > willingly (and happily) pay commercial pricing and adopt closed-source
>> > products if they can deliver a measurable reduction in administrative
>> > burden, operational effort or support burden.
>> >
>> > This is where Bright, Univa etc. all come in -- you can buy stuff from
>> > them that dramatically reduces that onsite/local IT has to manage the
>> > care and feeding of.
>> >
>> > Just having a vendor to call for support on Grid Engine oddities makes
>> > the cost of Univa licensing worthwhile. Just having a vendor like Bright
>> > be on the hook for "cluster operations" is a huge win for an overworked
>> > IT staff that does not have linux or HPC specialists on-staff or easily
>> > available.
>> >
>> > My best example of "paying to reduce operational burden in HPC" comes
>> > from a massive well known genome shop in the cambridge, MA area. They
>> > often tell this story:
>> >
>> > - 300 TB of new data generation per week (many years ago)
>> > - One of the initial storage tiers was ZFS running on commodity server
>> > hardware
>> > - Keeping the DIY ZFS appliances online and running took the FULL TIME
>> > efforts of FIVE STORAGE ENGINEERS
>> >
>> > They realized that staff support was not scalable with DIY/ZFS at
>> > 300TB/week of new data generation so they went out and bought a giant
>> > EMC Isilon scale-out NAS platform
>> >
>> > And you know what? After the Isilon NAS was deployed the management of
>> > *many* petabytes of single-namespace storage was now handled by the IT
>> > Director in his 'spare time' -- And the five engineers who used to do
>> > nothing but keep ZFS from falling over were re-assigned to more
>> > impactful and presumably more fun/interesting work.
>> >
>> >
>> > They actually went on stage at several conferences and told the story of
>> > how Isilon allowed senior IT leadership to manage petabyte volumes of
>> > data "in their spare time" -- this was a huge deal and really resonated
>> > . Really reinforced for me how in some cases it's actually a good idea
>> > to pay $$$ for commercial stuff if it delivers gains in
>> > ops/support/management.
>> >
>> >
>> > Sorry to digress! This is a topic near and dear to me. I often have to
>> > do HPC work in commercial environments where the skills simply don't
>> > exist onsite. Or more commonly -- they have budget to buy software or
>> > hardware but they are under a hiring freeze and are not allowed to bring
>> > in new Humans.
>> >
>> > Quite a bit of my work on projects like this is helping people make
>> > sober decisions regarding "build" or "buy" -- and in those environments
>> > it's totally clear that for some things it makes sense for them to pay
>> > for an expensive commercially supported "thing" that they don't have to
>> > manage or support themselves
>> >
>> >
>> > My $.02 ...
>> >
>> >
>> >
>> >
>> >
>> >
>> > _______________________________________________
>> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>> Computing
>> > To change your subscription (digest mode or unsubscribe) visit
>> > http://www.beowulf.org/mailman/listinfo/beowulf
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20180503/2f893979/attachment-0001.html>