[Beowulf] CPU shifts?? and time problems

Wed Sep 2 20:33:38 PDT 2009

Hi,
please see below

On Wed, Sep 2, 2009 at 6:57 PM, Mark Hahn <hahn at mcmaster.ca> wrote:

> On Wed, 2 Sep 2009, amjad ali wrote:
>
>  Hi All,
>> I have 4-Nodes ( 4 CPUs Xeon3085, total 8 cores) Beowulf cluster on
>> ROCKS-5
>> with GiG-Ethernet. I tested runs of a 1D CFD code both serial and parallel
>> on it.
>> Please reply following:
>>
>> 1) When I run my serial code on the dual-core head node (or parallel code
>> with -np 1); it gives results in about 2 minutes. What I observe is that
>> "System Monitor" application show that some times CPU1 become busy 80+%
>> and
>> CPU2 around 10% busy. After some time CPU1 gets share around 10% busy
>> while
>> the CPU2 becomes 80+% busy. Such fluctuations/swap-of-busy-ness continue
>> till end. Why this is so? Does this busy-ness shifts/swaping harms
>> performance/speed?
>>
>
> the kernel decides where to run processes based on demand.  if the machine
> were otherwise idle, your process would stay on the same CPU.  depending on
> the particular kernel release, the kernel uses various heuristics to decide
> how much to "resist" moving the process among cpus.
>
> the cost of moving among cpus depends entirely on how much your code
> depends
> on the resources tied to one cpu or the other.  for instance, if your code
> has a very small memory footprint, moving will have only trivial cost.
> if your process has a larger working set size, but fits in onchip cache,
> it may be relatively expensive to move to a different processor in the
> system that doesn't share cache.  consider a 6M L3 in a 2-socket system,
> for instance: the inter-socket bandwidth will be approximately memory
> speed,
> which on a core2 system is something like 6 GB/s.  so migration will incur
> about a 1ms overhead (possibly somewhat hidden by concurrency.)
>
> in your case (if I have the processor spec right), you have 2 cores
> sharing a single 4M L2.  L1 cache is unshared, but trivial in size,
> so migration cost should be considered near-zero.
>
> the numactl command lets you bind a cpu to a processor if you wish.
> this is normally valuable on systems with more complex topologies,
> such as combinations of shared and unshared caches, especially when divided
> over multiple sockets, and with NUMA memory (such as opterons and nehalems.)
>
>  2)  When I run my parallel code with -np 2 on the dual-core headnode only;
>> it gives results in about 1 minute. What I observe is that "System
>> Monitor"
>> application show that all the time CPU1 and CPU2 remain busy 100%.
>>
>
> no problem there.  normally, though, it's best to _not_ run extraneous
> processes, and instead only look at the elapsed time that the job takes to
> run.  that is the metric that you should care about.
>
>  3)  When I run my parallel code with "-np 4" and "-np 8" on the dual-core
>> headnode only; it gives results in about 2 and 3.20 minutes respectively.
>> What I observe is that "System Monitor" application show that all the time
>> CPU1 and CPU2 remain busy 100%.
>>
>
> sure.  with 4 cpus, you're overloading the cpus, but they timeslice fairly
> efficiently, so you don't lose.  once you get to 8 cpus, you lose because
> the overcommitted processes start interfering (probably their working set
> is blowing the L2 cache.)
>
>  4)  When I run my parallel code with "-np 4" and "-np 8" on the 4-node (8
>> cores) cluster; it gives results in about 9 (NINE) and 12 minutes. What I
>>
>
> well, then I think it's a bit hyperbolic to call it a parallel code ;)
> seriously, all you've learned here is that your interconnect is causing
> your code to not scale.  the problem could be your code or the
> interconnect.
>
>  observe is that "System Monitor" application show CPU usage fluctuations
>> somewhat as in point number 1 above (CPU1 remains dominant busy most of
>> the
>> time), in case of -np 4. Does this means that an MPI-process is shifting
>> to
>> different cores/cpus/nodes? Does these shiftings harm performance/speed?
>>
>
> MPI does not shift anything.  the kernel may rebalance runnable processes
> within a single node, but not across nodes.  it's difficult to tell how much
> your monitoring is harming the calculation or perturbing the load-balance.
>
>  5) Why "-np 4" and "-np 8" on cluster is taking too much time as compare
>> to
>> -np 2 on the headnode? Obviously its due to communication overhead! but
>> how
>> to get better performance--lesser run time? My code is not too complicated
>> only 2 values are sent and 2 values are received by each process after
>> each
>> stage.
>>
>
> then do more work between sends and receives.  hard to say without knowing
> exactly what the communication pattern is.
>

Here is my subroutine:

    IF (myrank /= p-1) CALL
MPI_ISEND(u_local(Np,E),1,MPI_REAL8,myrank+1,55,MPI_COMM_WORLD,right(1),
ierr)
    IF (myrank /= 0)   CALL
MPI_ISEND(u_local(1,B),1,MPI_REAL8,myrank-1,66,MPI_COMM_WORLD,left(1), ierr)

    IF (myrank /= 0) CALL MPI_IRECV(u_left_exterior,1, MPI_REAL8, myrank-1,
55, MPI_COMM_WORLD,right(2), ierr)
    IF (myrank /= p-1) CALL MPI_IRECV(u_right_exterior,1, MPI_REAL8,
myrank+1, 66, MPI_COMM_WORLD,left(2), ierr)

     u0_local=RESHAPE(u_local,(/Np*K_local/))
    du0_local=RESHAPE(du_local,(/Nfp*Nfaces*K_local/))
    q_local = rx_local*MATMUL(Dr,u_local)

    DO I = shift*Nfp*Nfaces+1+1 , shift*Nfp*Nfaces+Nfp*Nfaces*K_local-1
        du0_local(I) =
(u0_local(vmapM_local(I))-u0_local(vmapP_local(I)))/2.0_8
    ENDDO

    I = shift*Nfp*Nfaces+1
    IF (myrank == 0) du0_local(I) =
(u0_local(vmapM_local(I))-u0_local(vmapP_local(I)))/2.0_8

    IF (myrank /= p-1) CALL MPI_WAIT(right(1), status, ierr)
    IF (myrank /= 0) CALL MPI_WAIT(left(1), status, ierr)
    IF (myrank /= 0) CALL MPI_WAIT(right(2), status, ierr)
    IF (myrank /= p-1) CALL MPI_WAIT(left(2), status, ierr)

    IF (myrank /= 0) THEN
       du0_local(I) = (u0_local(vmapM_local(I))-u_left_exterior)/2.0_8
    ENDIF

    I = shift*Nfp*Nfaces+Nfp*Nfaces*K_local
    IF (myrank == p-1) du0_local(I) =
(u0_local(vmapM_local(I))-u0_local(vmapP_local(I)))/2.0_8

    IF (myrank /= p-1) THEN
       du0_local(I) = (u0_local(vmapM_local(I))-u_right_exterior)/2.0_8

    ENDIF

    IF (myrank == 0)   du0_local(mapI) = 0.0_8
    IF (myrank == p-1) du0_local(mapO) = 0.0_8

    du_local = RESHAPE(du0_local,(/Nfp*Nfaces,K_local/))

    q_local = q_local-MATMUL(LIFT,Fscale_local*(nx_local*du_local))

    IF (myrank /= p-1) CALL
MPI_ISEND(q_local(Np,E),1,MPI_REAL8,myrank+1,551,MPI_COMM_WORLD,right(1),
ierr)
    IF (myrank /= 0) CALL
MPI_ISEND(q_local(1,B),1,MPI_REAL8,myrank-1,661,MPI_COMM_WORLD,left(1),
ierr)

    IF (myrank /= 0) CALL MPI_IRECV(q_left_exterior,1, MPI_REAL8, myrank-1,
551, MPI_COMM_WORLD,right(2), ierr)
    IF (myrank /= p-1) CALL MPI_IRECV(q_right_exterior,1, MPI_REAL8,
myrank+1, 661, MPI_COMM_WORLD,left(2), ierr)

    q0_local=RESHAPE(q_local,(/Np*K_local/))
    dq0_local=RESHAPE(dq_local,(/Nfp*Nfaces*K_local/))

    rhsu_local= rx_local*MATMUL(Dr,q_local) +
u_local*(u_local-a1)*(1.0_8-u_local) - v_local

    DO I = shift*Nfp*Nfaces+1+1 , shift*Nfp*Nfaces+Nfp*Nfaces*K_local-1
        dq0_local(I) =
(q0_local(vmapM_local(I))-q0_local(vmapP_local(I)))/2.0_8
    ENDDO

    I = shift*Nfp*Nfaces+1

    IF (myrank /= p-1) CALL MPI_WAIT(right(1), status, ierr)
    IF (myrank /= 0) CALL MPI_WAIT(left(1), status, ierr)
    IF (myrank /= 0) CALL MPI_WAIT(right(2), status, ierr)
    IF (myrank /= p-1) CALL MPI_WAIT(left(2), status, ierr)

    IF (myrank /= 0) THEN
        dq0_local(I) = (q0_local(vmapM_local(I))-q_left_exterior)/2.0_8
    ENDIF

    I = shift*Nfp*Nfaces+Nfp*Nfaces*K_local

    IF (myrank /= p-1) THEN
          dq0_local(I) = (q0_local(vmapM_local(I))-q_right_exterior)/2.0_8

    ENDIF

    IF (myrank == 0)   dq0_local(mapI) = q0_local(vmapI)+0.225_8
    IF (myrank == p-1) dq0_local(mapO) = q0_local(vmapO)

    dq_local = RESHAPE(dq0_local,(/Nfp*Nfaces,K_local/))

    rhsu_local= rhsu_local-MATMUL(LIFT,(Fscale_local*(nx_local*dq_local)))

END SUBROUTINE
=====================================================================

Is it suffieciently goog or there are some serious problems with the
communication pattern. Here Arrays are not very big because Nfp*Nfaces = 2
and K_local = < 30.

>
> I think you should first validate your cluster to see that the Gb is
> running as fast as expected.  actually, that everything is running right.
> that said, Gb is almost not a cluster interconnect at all, since it's so
> much slower than the main competitors (IB mostly, to some extent 10GE).
> fatter nodes (dual-socket quad-core, for instance) would at least decrease
> the effect of slow interconnect.
>
> you might also try instaling openMX, which is an ethernet protocol
> optimized for MPI (rather than your current MPI which is presumably layered
> on top of the usual TCP stack, which is optimized for wide-area
> streaming transfers.)  heck, you can probably obtain some speedup by
> tweaking your coalesce settings via ethtool.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20090903/778a3c40/attachment.html>