[Beowulf] Rant on why HPC isn't as easy as I'd like it to be.

Tue Sep 21 11:24:45 UTC 2021

Some points well made here. I have seen in the past job scripts passed on
from graduate student to graduate student - the case I am thinking on was
an Abaqus script for 8 core systems, being run on a new 32 core system. Why
WOULD a graduate student question a script given to them - which works.
They should be getting on with their science. I guess this is where
Research Software Engineers come in.

Another point I would make is about modern processor architectures, for
instance AMD Rome/Milan. You can have different Numa Per Socket options,
which affect performance. We set the preferred IO path - which I have seen
myself to have an effect on latency of MPI messages. IF you are not
concerned about your hardware layout you would just go ahead and run,
missing  a lot of performance.

I am now going to be controversial and common that over in Julia land the
pattern seems to be these days people develop on their own laptops, or
maybe local GPU systems. There is a lot of microbenchmarking going on. But
there seems to be not a lot of thought given to CPU pinning or shat happens
with hyperthreading. I guess topics like that are part of HPC 'Black Magic'
- though I would imagine the low latency crowd are hot on them.

I often introduce people to the excellent lstopo/hwloc utilities which show
the layout of a system. Most people are pleasantly surprised to find this.

On Mon, 20 Sept 2021 at 19:28, Lux, Jim (US 7140) via Beowulf <
beowulf at beowulf.org> wrote:

> The recent comments on compilers, caches, etc., are why HPC isn’t a bigger
> deal.  The infrastructure today is reminiscent of what I used in the 1970s
> on a big CDC or Burroughs or IBM machine, perhaps with a FPS box attached.
>
> I prepare a job, with some sort of job control structure, submit it to a
> batch queue, and get my results some time later.  Sure, I’m not dropping
> off a deck or tapes, and I’m not getting green-bar paper or a tape back,
> but really, it’s not much different – I drop a file and get files back
> either way.
>
>
>
> And just like back then, it’s up to me to figure out how best to arrange
> my code to run fastest (or me, wall clock time, but others it might be CPU
> time or cost or something else)
>
>
>
> It would be nice if the compiler (or run-time or infrastructure) figured
> out the whole “what’s the arrangement of cores/nodes/scratch storage for
> this application on this particular cluster”.
>
> I also acknowledge that this is a “hard” problem and one that doesn’t have
> the commercial value of, say, serving the optimum ads to me when I read the
> newspaper on line.
>
>
> Yeah, it’s not that hard to call library routines for matrix operations,
> and to put my trust in the library writers – I trust them more than I trust
> me to find the fastest linear equation solver, fft, etc. – but so far, the
> next level of abstraction up – “how many cores/nodes” is still left to me,
> and that means doing instrumentation, figuring out what the results mean,
> etc.
>
>
>
>
>
> *From: *Beowulf <beowulf-bounces at beowulf.org> on behalf of "
> beowulf at beowulf.org" <beowulf at beowulf.org>
> *Reply-To: *Jim Lux <james.p.lux at jpl.nasa.gov>
> *Date: *Monday, September 20, 2021 at 10:42 AM
> *To: *Lawrence Stewart <stewart at serissa.com>, Jim Cownie <
> jcownie at gmail.com>
> *Cc: *Douglas Eadline <deadline at eadline.org>, "beowulf at beowulf.org" <
> beowulf at beowulf.org>
> *Subject: *Re: [Beowulf] [EXTERNAL] Re: Deskside clusters
>
>
>
>
>
>
>
> *From: *Beowulf <beowulf-bounces at beowulf.org> on behalf of Lawrence
> Stewart <stewart at serissa.com>
> *Date: *Monday, September 20, 2021 at 9:17 AM
> *To: *Jim Cownie <jcownie at gmail.com>
> *Cc: *Lawrence Stewart <stewart at serissa.com>, Douglas Eadline <
> deadline at eadline.org>, "beowulf at beowulf.org" <beowulf at beowulf.org>
> *Subject: *Re: [Beowulf] [EXTERNAL] Re: Deskside clusters
>
>
>
> Well said.  Expanding on this, caches work because of both temporal
> locality and
>
> spatial locality.  Spatial locality is addressed by having cache lines be
> substantially
>
> larger than a byte or word.  These days, 64 bytes is pretty common.  Some
> prefetch schemes,
>
> like the L1D version that fetches the VA ^ 64 clearly affect spatial
> locality.  Streaming
>
> prefetch has an expanded notion of “spatial” I suppose!
>
>
>
> What puzzles me is why compilers seem not to have evolved much notion of
> cache management. It
>
> seems like something a smart compiler could do.  Instead, it is left to
> Prof. Goto and the folks
>
> at ATLAS and BLIS to figure out how to rewrite algorithms for efficient
> cache behavior. To my
>
> limited knowledge, compilers don’t make much use of PREFETCH or any
> non-temporal loads and stores
>
> either. It seems to me that once the programmer helps with RESTRICT and so
> forth, then compilers could perfectly well dynamically move parts of arrays
> around to maximize cache use.
>
>
>
> -L
>
>
>
> I suspect that there’s enough variability among cache implementation and
> the wide variety of algorithms that might use it that writing a
> smart-enough compiler is “hard” and “expensive”.
>
>
>
> Leaving it to the library authors is probably the best “bang for the
> buck”.
>
>
>
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210921/a339e9eb/attachment-0001.htm>