[Beowulf] hpl size problems

Mon Sep 26 12:20:00 PDT 2005

> Warewulf by default creates the virtual node file system to be extremely
> minimal yet fully functional and tuned for the job at hand (which exists
> in a hybrid RAM/NFS file system).

but HPL does very little IO and runs few commands.

> The nodes are lightweight in both file
> system and process load (context switching and cache management can be
> expensive especially on a non-NUMA SMP systems with lots of cache). The
> more daemons and extra processes that are running, the higher the
> process load and context switching that must occur.

it's hard to guess since we don't know what you were running before.
the only way I can imagine this (random procs) mattering is if you
were running a full desktop install before, and had some polling daemons
running.  (magicdev, artsd, etc).

on my favorite cluster, I use the obvious kind of initrd+tmpfs+NFS
and don't run any extra daemons.  on a randomly chosen node running 
two MPI workers (out of 64 in the job), "vmstat 10" looks like this:

[hahn at node1 hahn]$ ssh node70 vmstat 10
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 2  0 2106876  58240  11504 1309148    0    0     0     0 1035    54 99  1  0  0
 2  0 2106876  58312  11504 1309148    0    0     0     0 1037    59 99  1  0  0
 2  0 2106876  58312  11504 1309148    0    0     0     0 1034    55 99  1  0  0
 2  0 2106876  58312  11504 1309148    0    0     0     0 1033    56 99  1  0  0
 2  0 2106876  58312  11504 1309148    0    0     0     0 1034    44 99  1  0  0
 2  0 2106876  58312  11504 1309148    0    0     0     0 1031    41 99  1  0  0
 2  0 2106876  58312  11504 1309148    0    0     0     0 1033    39 99  1  0  0

I haven't updated the kernel to a lower HZ yet, but will soon.  I assert 
without the faintest whisp of proof that 50 cs/sec is inconsequential.
the gigabit on these nodes is certainly not sterile either - plenty of NFS 
traffic, even some NTP broadcasts.  actually, I just tcpdumped it a bit,
and the basal net rate is an arp, 4ish NFS access/getattr calls every 60 seconds.

> It reminds me of chapter 1 of sysadmin 101: Only install what you *need*

sure, but that's not inherent to your system, and unless you had some pretty
godaweful stuff installed before, it's hard to see that explanation...

> If someone else also has thoughts as to what would have caused the
> speedup, I would be very interested.

a full-fledged desktop load doesn't cause *that* much extraneous load - 
yes, there are interrupts and the like, but you have to remember that 
modern machines have massive memory bandwidth, big, associative caches,
and such stuff doesn't matter much.

especially for HPL - it's not exactly tightly-coupled, is it?  if it were
(ie, MANY global collectives per second), then I could easily buy the 
explanation that removal of random daemons would help a lot.  after all,
this has been known for a long time (though generally only on very large 
clusters).

> > > hours) running on Centos-3.5 and saw a pretty amazing speedup of the
> > > scientific code (*over* 30% faster runtimes) then with the previous
> > > RedHat/Rocks build. Warewulf also makes the cluster rather trivial to
> > 
> > such a speedup is indeed impressive; what changed?
> 
> Actually, we used the same kernel (recompiled from RHEL), and exactly the
> same compilers, mpi and IB (literally the same RPMS). The only thing
> that changed was the cluster management paradigm. The tests were done
> back to back with no hardware changes.

afaik, recompiling a distro kernel does generally not get you 
the same binary as what the distro distributes ;)

regards, mark hahn.