[Beowulf] hpl size problems

Mon Sep 26 13:58:58 PDT 2005

On Mon, Sep 26, 2005 at 03:20:00PM -0400, Mark Hahn wrote:
> > Warewulf by default creates the virtual node file system to be extremely
> > minimal yet fully functional and tuned for the job at hand (which exists
> > in a hybrid RAM/NFS file system).
> 
> but HPL does very little IO and runs few commands.

My main point was just that the systems are lightweight. ;)

> > The nodes are lightweight in both file
> > system and process load (context switching and cache management can be
> > expensive especially on a non-NUMA SMP systems with lots of cache). The
> > more daemons and extra processes that are running, the higher the
> > process load and context switching that must occur.
> 
> it's hard to guess since we don't know what you were running before.
> the only way I can imagine this (random procs) mattering is if you
> were running a full desktop install before, and had some polling daemons
> running.  (magicdev, artsd, etc).

The previous install was Platform Rocks. Honestly I did not examine the
implementation very carefully except that the nodes were rather heavy.
Once things were running smoothly and we can verify that the hardware
was working properly, it was immediately reinstalled.

> on my favorite cluster, I use the obvious kind of initrd+tmpfs+NFS
> and don't run any extra daemons.  on a randomly chosen node running 
> two MPI workers (out of 64 in the job), "vmstat 10" looks like this:
> 
> [hahn at node1 hahn]$ ssh node70 vmstat 10
> procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
>  2  0 2106876  58240  11504 1309148    0    0     0     0 1035    54 99  1  0  0
>  2  0 2106876  58312  11504 1309148    0    0     0     0 1037    59 99  1  0  0
>  2  0 2106876  58312  11504 1309148    0    0     0     0 1034    55 99  1  0  0
>  2  0 2106876  58312  11504 1309148    0    0     0     0 1033    56 99  1  0  0
>  2  0 2106876  58312  11504 1309148    0    0     0     0 1034    44 99  1  0  0
>  2  0 2106876  58312  11504 1309148    0    0     0     0 1031    41 99  1  0  0
>  2  0 2106876  58312  11504 1309148    0    0     0     0 1033    39 99  1  0  0
> 
> I haven't updated the kernel to a lower HZ yet, but will soon.  I assert 
> without the faintest whisp of proof that 50 cs/sec is inconsequential.
> the gigabit on these nodes is certainly not sterile either - plenty of NFS 
> traffic, even some NTP broadcasts.  actually, I just tcpdumped it a bit,
> and the basal net rate is an arp, 4ish NFS access/getattr calls every 60 seconds.
> 
> 
> > It reminds me of chapter 1 of sysadmin 101: Only install what you *need*
> 
> sure, but that's not inherent to your system, and unless you had some pretty
> godaweful stuff installed before, it's hard to see that explanation...

Right you are, so lets then call it a feature of the implementation.

> > If someone else also has thoughts as to what would have caused the
> > speedup, I would be very interested.
> 
> a full-fledged desktop load doesn't cause *that* much extraneous load - 
> yes, there are interrupts and the like, but you have to remember that 
> modern machines have massive memory bandwidth, big, associative caches,
> and such stuff doesn't matter much.

I was thinking that the increased context switching that would occur
with more processes running would also increase the frequency at which
the processes would bounce between the CPU's (no CPU/memory affinity).
Now add to that the time it takes to repopulate the 2MB of L2 cache.

> especially for HPL - it's not exactly tightly-coupled, is it?  if it were
> (ie, MANY global collectives per second), then I could easily buy the 
> explanation that removal of random daemons would help a lot.  after all,
> this has been known for a long time (though generally only on very large 
> clusters).

Right, the speedup was not as significant with HPL as it was with the
tightly coupled production code that this system is primarily used for.

Sorry, I may have been vague there when I referred to the 30% speedup.

> > Actually, we used the same kernel (recompiled from RHEL), and exactly the
> > same compilers, mpi and IB (literally the same RPMS). The only thing
> > that changed was the cluster management paradigm. The tests were done
> > back to back with no hardware changes.
> 
> afaik, recompiling a distro kernel does generally not get you 
> the same binary as what the distro distributes ;)

Yes and no... it would depend on how you describe "same". ;)

Recompiling the same SRPM, with the same compiler and compiler options
against same headers *should* yield a binary that is as close as
"legally" possible. Of course it is not checksum identical but for all
practical purposes, it is as close as one can get. ;^)
-- 
Greg Kurtzer
Berkeley Lab, Linux guy