[Beowulf] Parallel Programming Question

Wed Jul 1 12:41:47 PDT 2009

Hi Amjad, list

amjad ali wrote:
> Hi,
> Gus--thank you.
> You are right. I mainly have to run programs on a small cluster (GiGE) 
> dedicated for my job only; and sometimes I might get some opportunity to 
> run my code on a shared cluster with hundreds of nodes.
> 

Thanks for telling.
My guess was not very far from true. :)

BTW, if you are doing cluster control, I/O and MPI all across the same
GigE network, and if your nodes have dual GigE ports, you may consider
buying a second switch (or using VLAN on your existing switch,
if it has it) to deploy a second GigE network only for MPI.
In case you haven't done this yet, of course.
The cost is very modest, and the performance should improve.
OpenMPI and MPICH can select which network they use, leaving the other 
one for I/O and control.

> My parallel CFD application involves (In its simplest form):
> 1) reading of input and mesh data from few files by the master process (I/O)
> 2) Sending the full/respective data to all other processes (MPI 
> Broadcast/Scatter)
> 3) Share the values at subdomains (created by Metis) boundaries at the 
> end of each iteration (MPI Message Passing)
> 4) On converge,  send/Gather the results from all processes to master 
> process
> 5) Writing the results to files by the master process (I/O).
> 
> So I think my program is not I/O intensive; so the Funneling I/O through 
> the master process is sufficient for me. Right?
> 

This scheme doesn't differ much from most
atmosphere/ocean/climate models we run here.
(They call the programs "models" in this community.)
After all, part of the computations here are also CFD-type,
although with reduced forms of the Navier-Stokes equation
in a rotating planet.
(Other computations are not CFD, e.g. radiative transfer.)

We tend to output data every 4000 time steps (order of magnitude),
and in extreme, rare, cases every 40 time steps or so.
There is a lot of computation on each time step, and the usual
exchange of boundary values across subdomains using MPI
(i.e. we also use domain decomposition).
You may have adaptive meshes though,
whereas most of our models use fixed grids.

For this pattern of work and this ratio of 
computation-to-communication-to-I/O,
the models that work best are those that funnel I/O through
the "master" processor.

My guess is that this scheme would work OK for you also,
since you seem to output data only "on convergence" (to a steady state 
perhaps?).
I presume this takes many time steps,
and involves a lot of computation, and a significant amount of MPI 
communication, right?

> But now I have to parallelize a new serial code, which plots the results 
> while running  (online/live display).  Means that it shows the  plots 
> of  three/four variables (in small windows) while running and we see it 
> as video (all progress from initial stage to final stage). I assume that 
> this time much more I/O is involved. At the end of each iteration result 
> needs to be gathered from all processes to the master process. And 
> possibly needs to be written in files as well (I am not sure). Do we 
> need to write it on some file/s for online display, after each 
> iteration/time-step?
> 

Do you somehow use the movie results to control the program,
change its parameters, or its course of action?
If you do, then the "real time" feature is really required.
Otherwise, you could process the movie offline after the run ends,
although this would spoil the fun of seeing it live, no doubt about it.

I suppose a separate program shows the movie, right?
Short from a more sophisticated solution using MPI-I/O and perhaps
relying on a parallel file system, you could dump the
subdomain snapshots to the nodes' local disks, then run
a separate program to harvest this data, recompose the frames into
the global domain, and exhibit the movie.
(Using MPI types may help rebuild the frames on the global domain.)
If the snapshots are not dumped very often, funneling them through the
master processor using MPI_Gather[v], and letting the master processor 
output the result, would be OK also.

Regardless of how you do it, I doubt you need one snapshot for each
time step. It should be much less, as you say below.

> I think (as serial code will be displaying result after each 
> iteration/time step), I should display result online after 100 
> iterations/time-steps in my parallel version so less "I/O" and/or 
> "funneling I/O through master process" will be required.
> Any opinion/suggestion?
> 

Yes, you may want to decimate the number of snapshots that you dump to
file, to avoid I/O at every time step.
How many time steps between snapshots?
It depends on how fast the algorithm moves the solution, I would guess.
It should be an interval short enough to provide smooth transitions from 
frame to frame, but long enough to avoid too much I/O.
You may want to leave this number as a (Fortran namelist) parameter
that you can choose at runtime.

Movies are 24 frames per second (at least before the digital era).

Jean-Luc Goddard once said:
"Photography is truth. Cinema is truth twenty-four times per second."

Of course he also said:
"Cinema is the most beautiful fraud in the world.".

But you don't need to tell your science buddies or your adviser about 
that ... :)

Good luck!

Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

> regards,
> Amjad Ali.
> 
> 
> 
> On Wed, Jul 1, 2009 at 5:06 AM, Gus Correa <gus at ldeo.columbia.edu 
> <mailto:gus at ldeo.columbia.edu>> wrote:
> 
>     Hi Bogdan, list
> 
>     Oh, well, this is definitely a peer reviewed list.
>     My answers were given in the context of Amjad's original
>     questions, and the perception, based on Amjad's previous
>     and current postings, that he is not dealing with a large cluster,
>     or with many users, and plans to both parallelize and update his
>     code from F77 to F90, which can be quite an undertaking.
>     Hence, he may want to follow the path of least resistance,
>     rather than aim at the fanciest programming paradigm.
> 
>     In the edited answer that context was stripped off,
>     and so was the description of
>     "brute force" I/O in parallel jobs.
>     That was the form of concurrent I/O I was referring to.
>     An I/O mode which doesn't take any precautions
>     to avoid file and network contention, and unfortunately is more common
>     than clean, well designed, parallel I/O code (at least on the field
>     I work).
> 
>     That was the form of concurrent I/O I was referring to
>     (all processors try to do I/O at the same time using standard
>     open/read/write/close commands provided by Fortran or another language,
>     not MPI calls).
> 
>     Bogdan seems to be talking about programs with well designed
>     parallel I/O instead.
> 
> 
>     Bogdan Costescu wrote:
> 
>         On Wed, 24 Jun 2009, Gus Correa wrote:
> 
>             the "master" processor reads... broadcasts parameters that
>             are used by all "slave" processors, and scatters any data
>             that will be processed in a distributed fashion by each
>             "slave" processor.
>             ...
>             That always works, there is no file system contention.
> 
> 
>         I beg to disagree. There is no file system contention if this
>         job is the only one doing the I/O at that time, which could be
>         the case if a job takes the whole cluster. However, in a more
>         conventional setup with several jobs running at the same time,
>         there is I/O done from several nodes (running the MPI rank 0 of
>         each job) at the same time, which will still look like mostly
>         random I/O to the storage.
> 
> 
>     Indeed, if there are 1000 jobs running,
>     even if each one is funneling I/O through
>     the "master" processor, there will be a large number of competing
>     requests to the I/O system, hence contention.
>     However, contention would also happen if all jobs were serial.
>     Hence, this is not a problem caused by or specific from parallel jobs.
>     It is an intrinsic limitation of the I/O system.
> 
>     Nevertheless, what if these 1000 jobs are running on the same cluster,
>     but doing "brute force" I/O through
>     each of their, say, 100 processes?
>     Wouldn't file and network contention be larger than if the jobs were
>     funneling I/O through a single processor?
> 
>     That is the context in which I made my statement.
>     Funneling I/O through a "master" processor reduces the chances of file
>     contention because it minimizes the number of processes doing I/O,
>     or not?
> 
> 
>             Another drawback is that you need to write more code for the
>             I/O procedure.
> 
> 
>         I also disagree here. The code doing I/O would need to only
>         happen on MPI rank 0, so no need to think for the other ranks
>         about race conditions, computing a rank-based position in the
>         file, etc.
> 
> 
>      >From what you wrote,
>     you seem to agree with me on this point, not disagree.
> 
>     1) Brute force I/O through all ranks takes little programming effort,
>     the code is basically the same serial,
>     and tends to trigger file contention (and often breaks NFS, etc).
>     2) Funneling I/O through the master node takes a moderate programming
>     effort.  One needs to gather/scatter data through the "master"
>     processor, which concentrates the I/O, and reduces contention.
>     3) Correct and cautious parallel I/O across all ranks takes a larger
>     programming effort,
>     due to the considerations you pointed out above.
> 
> 
>             In addition, MPI is in control of everything, you are less
>             dependent on NFS quirks.
> 
> 
>         ... or cluster design. I have seen several clusters which were
>         designed with 2 networks, a HPC one (Myrinet or Infiniband) and
>         GigE, where the HPC network had full bisection bandwidth, but
>         the GigE was a heavily over-subscribed one as the design really
>         thought only about MPI performance and not about I/O
>         performance. In such an environment, it's rather useless to try
>         to do I/O simultaneously from several nodes which share the same
>         uplink, independent whether the storage is a single NFS server
>         or a parallel FS. Doing I/O from only one node would allow full
>         utilization of the bandwidth on the chain of uplinks to the
>         file-server and the data could then be scattered/gathered fast
>         through the HPC network. Sure, a more hardware-aware application
>         could have been more efficient (f.e. if it would be possible to
>         describe the network over-subscription so that as many uplinks
>         could be used simultaneously as possible), but a more balanced
>         cluster design would have been even better...
> 
> 
>     Absolutely, but the emphasis I've seen, at least for small clusters
>     designed for scientific computations in a small department or
>     research group is to pay less attention to I/O that I had the chance
>     to know about.
>     When one gets to the design of the filesystems and I/O the budget is
>     already completely used up to buy a fast interconnect for MPI.
>     I/O is then done over Gigabit Ethernet using a single NFS
>     file server (often times a RAID on the head node itself).
>     For the scale of a small cluster, with a few tens of nodes or so,
>     this may work OK,
>     as long as one writes code that is gentle with NFS
>     (e.g. by funneling I/O through the head node).
> 
>     Obviously the large clusters on our national labs and computer centers
>     do take into consideration I/O requirements, parallel file systems,
>     etc. However, that is not my reality here, and I would guess it is
>     not Amjad's situation either.
> 
> 
>             [ parallel I/O programs ] always cause a problem when the
>             number of processors is big.
> 
> 
> 
>     Sorry, but I didn't say parallel I/O programs.
>     I said brute force I/O by all processors (using standard NFS,
>     no parallel file system, all processors banging on the file system
>     with no coordination).
> 
> 
>         I'd also like to disagree here. Parallel file systems teach us
>         that a scalable system is one where the operations are split
>         between several units that do the work. Applying the same
>         knowledge to the generation of the data, a scalable application
>         is one for which the I/O operations are done as much as possible
>         split between the ranks.
> 
> 
>     Yes.
>     If you have a parallel file system.
> 
> 
> 
>         IMHO, the "problem" that you see is actually caused by reaching
>         the limits of your cluster, IOW this is a local problem of that
>         particular cluster and not a problem in the application. By
>         re-writing the application to make it more NFS-friendly (f.e.
>         like the above "rank 0 does all I/O"), you will most likely kill
>         scalability for another HPC setup with a distributed/parallel
>         storage setup.
> 
>     Yes, that is true, but may only be critical if the program is I/O
>     intensive (ours are not).
>     One may still fare well with funneling I/O through one or a few
>     processors, if the program is not I/O intensive,
>     and not compromise scalability.
> 
>     The opposite, however, i.e.,
>     writing the program expecting the cluster to
>     provide a parallel file system,
>     is unlikely to scale well on a cluster
>     without one, or not?
> 
> 
>             Often times these codes were developed on big iron machines,
>             ignoring the hurdles one has to face on a Beowulf.
> 
> 
>         Well, the definition of Beowulf is quite fluid. Nowadays is
>         sufficiently easy to get a parallel FS running with commodity
>         hardware that I wouldn't associate it anymore with big iron.
> 
> 
>     That is true, but very budget dependent.
>     If you are on a shoestring budget, and your goal is to do parallel
>     computing, and your applications are not particularly I/O intensive,
>     what would you prioritize: a fast interconnect for MPI,
>     or hardware and software for a parallel file system?
> 
> 
>             In general they don't use MPI parallel I/O either
> 
> 
>         Being on the teaching side in a recent course+practical work
>         involving parallel I/O, I've seen computer science and physics
>         students making quite easily the transition from POSIX I/O done
>         on a shared file system to MPI-I/O. They get sometimes an index
>         wrong, but mostly the conversion is painless. After that, my
>         impression has become that it's mostly lazyness and the attitude
>         'POSIX is everywhere anywhere, why should I bother with
>         something that might be missing' that keeps applications at this
>         stage.
> 
> 
>     I agree with your considerations about laziness and the POSIX-inertia.
>     However, there is still a long way to make programs and programmers
>     at least consider the restrictions imposed by network and file systems,
>     not to mention to use proper parallel I/O.
>     Hopefully courses like yours will improve this.
>     If I could, I would love to go to Heidelberg and take your class myself!
> 
>     Regards,
>     Gus Correa
> 
> 
>     _______________________________________________
>     Beowulf mailing list, Beowulf at beowulf.org
>     <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
>     To change your subscription (digest mode or unsubscribe) visit
>     http://www.beowulf.org/mailman/listinfo/beowulf
> 
>