[Beowulf] CPU shifts?? and time problems

Wed Sep 2 06:57:38 PDT 2009

On Wed, 2 Sep 2009, amjad ali wrote:

> Hi All,
> I have 4-Nodes ( 4 CPUs Xeon3085, total 8 cores) Beowulf cluster on ROCKS-5
> with GiG-Ethernet. I tested runs of a 1D CFD code both serial and parallel
> on it.
> Please reply following:
>
> 1) When I run my serial code on the dual-core head node (or parallel code
> with -np 1); it gives results in about 2 minutes. What I observe is that
> "System Monitor" application show that some times CPU1 become busy 80+% and
> CPU2 around 10% busy. After some time CPU1 gets share around 10% busy while
> the CPU2 becomes 80+% busy. Such fluctuations/swap-of-busy-ness continue
> till end. Why this is so? Does this busy-ness shifts/swaping harms
> performance/speed?

the kernel decides where to run processes based on demand.  if the machine
were otherwise idle, your process would stay on the same CPU.  depending on
the particular kernel release, the kernel uses various heuristics to decide
how much to "resist" moving the process among cpus.

the cost of moving among cpus depends entirely on how much your code depends
on the resources tied to one cpu or the other.  for instance, if your code 
has a very small memory footprint, moving will have only trivial cost.
if your process has a larger working set size, but fits in onchip cache,
it may be relatively expensive to move to a different processor in the 
system that doesn't share cache.  consider a 6M L3 in a 2-socket system,
for instance: the inter-socket bandwidth will be approximately memory speed,
which on a core2 system is something like 6 GB/s.  so migration will incur
about a 1ms overhead (possibly somewhat hidden by concurrency.)

in your case (if I have the processor spec right), you have 2 cores
sharing a single 4M L2.  L1 cache is unshared, but trivial in size,
so migration cost should be considered near-zero.

the numactl command lets you bind a cpu to a processor if you wish.
this is normally valuable on systems with more complex topologies,
such as combinations of shared and unshared caches, especially when 
divided over multiple sockets, and with NUMA memory (such as opterons 
and nehalems.)

> 2)  When I run my parallel code with -np 2 on the dual-core headnode only;
> it gives results in about 1 minute. What I observe is that "System Monitor"
> application show that all the time CPU1 and CPU2 remain busy 100%.

no problem there.  normally, though, it's best to _not_ run extraneous 
processes, and instead only look at the elapsed time that the job takes 
to run.  that is the metric that you should care about.

> 3)  When I run my parallel code with "-np 4" and "-np 8" on the dual-core
> headnode only; it gives results in about 2 and 3.20 minutes respectively.
> What I observe is that "System Monitor" application show that all the time
> CPU1 and CPU2 remain busy 100%.

sure.  with 4 cpus, you're overloading the cpus, but they timeslice fairly
efficiently, so you don't lose.  once you get to 8 cpus, you lose because 
the overcommitted processes start interfering (probably their working set
is blowing the L2 cache.)

> 4)  When I run my parallel code with "-np 4" and "-np 8" on the 4-node (8
> cores) cluster; it gives results in about 9 (NINE) and 12 minutes. What I

well, then I think it's a bit hyperbolic to call it a parallel code ;)
seriously, all you've learned here is that your interconnect is causing
your code to not scale.  the problem could be your code or the interconnect.

> observe is that "System Monitor" application show CPU usage fluctuations
> somewhat as in point number 1 above (CPU1 remains dominant busy most of the
> time), in case of -np 4. Does this means that an MPI-process is shifting to
> different cores/cpus/nodes? Does these shiftings harm performance/speed?

MPI does not shift anything.  the kernel may rebalance runnable processes 
within a single node, but not across nodes.  it's difficult to tell how much
your monitoring is harming the calculation or perturbing the load-balance.

> 5) Why "-np 4" and "-np 8" on cluster is taking too much time as compare to
> -np 2 on the headnode? Obviously its due to communication overhead! but how
> to get better performance--lesser run time? My code is not too complicated
> only 2 values are sent and 2 values are received by each process after each
> stage.

then do more work between sends and receives.  hard to say without knowing 
exactly what the communication pattern is.

I think you should first validate your cluster to see that the Gb is 
running as fast as expected.  actually, that everything is running right.
that said, Gb is almost not a cluster interconnect at all, since it's 
so much slower than the main competitors (IB mostly, to some extent 10GE).
fatter nodes (dual-socket quad-core, for instance) would at least decrease
the effect of slow interconnect.

you might also try instaling openMX, which is an ethernet protocol
optimized for MPI (rather than your current MPI which is presumably 
layered on top of the usual TCP stack, which is optimized for wide-area
streaming transfers.)  heck, you can probably obtain some speedup by 
tweaking your coalesce settings via ethtool.