[Beowulf] Rant on why HPC isn't as easy as I'd like it to be. [EXT]

Guy Coates guy.coates at gmail.com
Thu Sep 23 12:45:37 UTC 2021


Out of interest, how large are the compute jobs (memory, runtime etc)?  How
easy to get them to fit into a serverless environment?

Thanks,

Guy

On Tue, 21 Sept 2021 at 13:02, Tim Cutts <tjrc at sanger.ac.uk> wrote:

> I think that’s exactly the situation we’ve been in for a long time,
> especially in life sciences, and it’s becoming more entrenched.  My
> experience is that the average user of our scientific computing systems has
> been becoming less technically savvy for many years now.
>
> The presence of the cloud makes that more acute, in particular because it
> makes it easy for the user to effectively throw more hardware at the
> problem, which reduces the incentive to make their code particularly fast
> or efficient.  Cost is the only brake on it, and in many cases I’m finding
> the PI doesn’t actually care about that.  They care that a result is being
> obtained (and it’s time to first result they care about, not time to
> complete all the analysis), and so they typically don’t have much time for
> those of us who are telling them they need to invest in time up front
> developing and optimising efficient code.
>
> And cost is not necessarily the brake I thought it was going to be
> anyway.  One recent project we’ve done on AWS has impressed me a great
> deal.  It’s not terribly CPU efficient, and would doubtless, with
> sufficient effort, run much more efficiently on premise.  But it’s
> extremely elastic in its nature, and so a good fit for the cloud.   Once a
> week, the project has to completely re-analyse the 600,000+ COVID genomes
> we’e sequenced so far, looking for new branches in the phylogenetic tree,
> and to complete that analysis inside 8 hours.   Initial attempts to naively
> convert the HPC implementation to run on AWS looked as though they were
> going to be very expensive (~$50k per weekly run).  But a fundamental
> reworking of the entire workflow to make it as cloud native as possible, by
> which I mean almost exclusively serverless, has succeeded beyond what I
> expected.  The total cost is <$5,000 a month, and because there is
> essentially no statically configured infrastructure at all, the security is
> fairly easy to be comfortable about.  And all of that was done with no
> detailed thinking about whether the actual algorithms running in the
> containers are at all optimised in a traditional HPC sense.  It’s just not
> needed for this particular piece of work.  Did it need software developers
> with hardcore knowledge of performance optimisation?  No.  Was it rapid to
> develop and deploy?  Yes.  Is the performance fast enough for UK national
> COVID variant surveillance?  Yes.  Is it cost effective?  Yes.  Sold!  The
> one thing it did need was knowledgeable cloud architects, but the cloud
> providers can and do help with that.
>
> Tim
>
> --
> Tim Cutts
> Head of Scientific Computing
> Wellcome Sanger Institute
>
>
> On 21 Sep 2021, at 12:24, John Hearns <hearnsj at gmail.com> wrote:
>
> Some points well made here. I have seen in the past job scripts passed on
> from graduate student to graduate student - the case I am thinking on was
> an Abaqus script for 8 core systems, being run on a new 32 core system. Why
> WOULD a graduate student question a script given to them - which works.
> They should be getting on with their science. I guess this is where
> Research Software Engineers come in.
>
> Another point I would make is about modern processor architectures, for
> instance AMD Rome/Milan. You can have different Numa Per Socket options,
> which affect performance. We set the preferred IO path - which I have seen
> myself to have an effect on latency of MPI messages. IF you are not
> concerned about your hardware layout you would just go ahead and run,
> missing  a lot of performance.
>
> I am now going to be controversial and common that over in Julia land the
> pattern seems to be these days people develop on their own laptops, or
> maybe local GPU systems. There is a lot of microbenchmarking going on. But
> there seems to be not a lot of thought given to CPU pinning or shat happens
> with hyperthreading. I guess topics like that are part of HPC 'Black Magic'
> - though I would imagine the low latency crowd are hot on them.
>
> I often introduce people to the excellent lstopo/hwloc utilities which
> show the layout of a system. Most people are pleasantly surprised to find
> this.
>
>
> -- The Wellcome Sanger Institute is operated by Genome Research Limited, a
> charity registered in England with number 1021457 and a company registered
> in England with number 2742969, whose registered office is 215 Euston Road,
> London, NW1 2BE.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>


-- 
Dr. Guy Coates
+44(0)7801 710224
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210923/1d3213dc/attachment.htm>


More information about the Beowulf mailing list