[Beowulf] Parallel Programming Question

Tue Jun 30 21:55:51 PDT 2009

Hi,
Gus--thank you.
You are right. I mainly have to run programs on a small cluster (GiGE)
dedicated for my job only; and sometimes I might get some opportunity to run
my code on a shared cluster with hundreds of nodes.

My parallel CFD application involves (In its simplest form):
1) reading of input and mesh data from few files by the master process (I/O)
2) Sending the full/respective data to all other processes (MPI
Broadcast/Scatter)
3) Share the values at subdomains (created by Metis) boundaries at the end
of each iteration (MPI Message Passing)
4) On converge,  send/Gather the results from all processes to master
process
5) Writing the results to files by the master process (I/O).

So I think my program is not I/O intensive; so the Funneling I/O through the
master process is sufficient for me. Right?

But now I have to parallelize a new serial code, which plots the results
while running  (online/live display).  Means that it shows the  plots of
three/four variables (in small windows) while running and we see it as video
(all progress from initial stage to final stage). I assume that this time
much more I/O is involved. At the end of each iteration result needs to be
gathered from all processes to the master process. And possibly needs to be
written in files as well (I am not sure). Do we need to write it on some
file/s for online display, after each iteration/time-step?

I think (as serial code will be displaying result after each iteration/time
step), I should display result online after 100 iterations/time-steps in my
parallel version so less "I/O" and/or "funneling I/O through master process"
will be required.
Any opinion/suggestion?

regards,
Amjad Ali.

On Wed, Jul 1, 2009 at 5:06 AM, Gus Correa <gus at ldeo.columbia.edu> wrote:

> Hi Bogdan, list
>
> Oh, well, this is definitely a peer reviewed list.
> My answers were given in the context of Amjad's original
> questions, and the perception, based on Amjad's previous
> and current postings, that he is not dealing with a large cluster,
> or with many users, and plans to both parallelize and update his code from
> F77 to F90, which can be quite an undertaking.
> Hence, he may want to follow the path of least resistance,
> rather than aim at the fanciest programming paradigm.
>
> In the edited answer that context was stripped off,
> and so was the description of
> "brute force" I/O in parallel jobs.
> That was the form of concurrent I/O I was referring to.
> An I/O mode which doesn't take any precautions
> to avoid file and network contention, and unfortunately is more common
> than clean, well designed, parallel I/O code (at least on the field I
> work).
>
> That was the form of concurrent I/O I was referring to
> (all processors try to do I/O at the same time using standard
> open/read/write/close commands provided by Fortran or another language,
> not MPI calls).
>
> Bogdan seems to be talking about programs with well designed parallel I/O
> instead.
>
> Bogdan Costescu wrote:
>
>> On Wed, 24 Jun 2009, Gus Correa wrote:
>>
>>  the "master" processor reads... broadcasts parameters that are used by
>>> all "slave" processors, and scatters any data that will be processed in a
>>> distributed fashion by each "slave" processor.
>>> ...
>>> That always works, there is no file system contention.
>>>
>>
>> I beg to disagree. There is no file system contention if this job is the
>> only one doing the I/O at that time, which could be the case if a job takes
>> the whole cluster. However, in a more conventional setup with several jobs
>> running at the same time, there is I/O done from several nodes (running the
>> MPI rank 0 of each job) at the same time, which will still look like mostly
>> random I/O to the storage.
>>
>>
> Indeed, if there are 1000 jobs running,
> even if each one is funneling I/O through
> the "master" processor, there will be a large number of competing
> requests to the I/O system, hence contention.
> However, contention would also happen if all jobs were serial.
> Hence, this is not a problem caused by or specific from parallel jobs.
> It is an intrinsic limitation of the I/O system.
>
> Nevertheless, what if these 1000 jobs are running on the same cluster,
> but doing "brute force" I/O through
> each of their, say, 100 processes?
> Wouldn't file and network contention be larger than if the jobs were
> funneling I/O through a single processor?
>
> That is the context in which I made my statement.
> Funneling I/O through a "master" processor reduces the chances of file
> contention because it minimizes the number of processes doing I/O,
> or not?
>
>  Another drawback is that you need to write more code for the I/O
>>> procedure.
>>>
>>
>> I also disagree here. The code doing I/O would need to only happen on MPI
>> rank 0, so no need to think for the other ranks about race conditions,
>> computing a rank-based position in the file, etc.
>>
>>
> From what you wrote,
> you seem to agree with me on this point, not disagree.
>
> 1) Brute force I/O through all ranks takes little programming effort,
> the code is basically the same serial,
> and tends to trigger file contention (and often breaks NFS, etc).
> 2) Funneling I/O through the master node takes a moderate programming
> effort.  One needs to gather/scatter data through the "master" processor,
> which concentrates the I/O, and reduces contention.
> 3) Correct and cautious parallel I/O across all ranks takes a larger
> programming effort,
> due to the considerations you pointed out above.
>
>  In addition, MPI is in control of everything, you are less dependent on
>>> NFS quirks.
>>>
>>
>> ... or cluster design. I have seen several clusters which were designed
>> with 2 networks, a HPC one (Myrinet or Infiniband) and GigE, where the HPC
>> network had full bisection bandwidth, but the GigE was a heavily
>> over-subscribed one as the design really thought only about MPI performance
>> and not about I/O performance. In such an environment, it's rather useless
>> to try to do I/O simultaneously from several nodes which share the same
>> uplink, independent whether the storage is a single NFS server or a parallel
>> FS. Doing I/O from only one node would allow full utilization of the
>> bandwidth on the chain of uplinks to the file-server and the data could then
>> be scattered/gathered fast through the HPC network. Sure, a more
>> hardware-aware application could have been more efficient (f.e. if it would
>> be possible to describe the network over-subscription so that as many
>> uplinks could be used simultaneously as possible), but a more balanced
>> cluster design would have been even better...
>>
>>
> Absolutely, but the emphasis I've seen, at least for small clusters
> designed for scientific computations in a small department or research
> group is to pay less attention to I/O that I had the chance to know about.
> When one gets to the design of the filesystems and I/O the budget is
> already completely used up to buy a fast interconnect for MPI.
> I/O is then done over Gigabit Ethernet using a single NFS
> file server (often times a RAID on the head node itself).
> For the scale of a small cluster, with a few tens of nodes or so,
> this may work OK,
> as long as one writes code that is gentle with NFS
> (e.g. by funneling I/O through the head node).
>
> Obviously the large clusters on our national labs and computer centers
> do take into consideration I/O requirements, parallel file systems, etc.
> However, that is not my reality here, and I would guess it is
> not Amjad's situation either.
>
>  [ parallel I/O programs ] always cause a problem when the number of
>>> processors is big.
>>>
>>
>>
> Sorry, but I didn't say parallel I/O programs.
> I said brute force I/O by all processors (using standard NFS,
> no parallel file system, all processors banging on the file system
> with no coordination).
>
>  I'd also like to disagree here. Parallel file systems teach us that a
>> scalable system is one where the operations are split between several units
>> that do the work. Applying the same knowledge to the generation of the data,
>> a scalable application is one for which the I/O operations are done as much
>> as possible split between the ranks.
>>
>
> Yes.
> If you have a parallel file system.
>
>
>> IMHO, the "problem" that you see is actually caused by reaching the limits
>> of your cluster, IOW this is a local problem of that particular cluster and
>> not a problem in the application. By re-writing the application to make it
>> more NFS-friendly (f.e. like the above "rank 0 does all I/O"), you will most
>> likely kill scalability for another HPC setup with a distributed/parallel
>> storage setup.
>>
>>  Yes, that is true, but may only be critical if the program is I/O
> intensive (ours are not).
> One may still fare well with funneling I/O through one or a few
> processors, if the program is not I/O intensive,
> and not compromise scalability.
>
> The opposite, however, i.e.,
> writing the program expecting the cluster to
> provide a parallel file system,
> is unlikely to scale well on a cluster
> without one, or not?
>
>  Often times these codes were developed on big iron machines, ignoring the
>>> hurdles one has to face on a Beowulf.
>>>
>>
>> Well, the definition of Beowulf is quite fluid. Nowadays is sufficiently
>> easy to get a parallel FS running with commodity hardware that I wouldn't
>> associate it anymore with big iron.
>>
>>
> That is true, but very budget dependent.
> If you are on a shoestring budget, and your goal is to do parallel
> computing, and your applications are not particularly I/O intensive,
> what would you prioritize: a fast interconnect for MPI,
> or hardware and software for a parallel file system?
>
>  In general they don't use MPI parallel I/O either
>>>
>>
>> Being on the teaching side in a recent course+practical work involving
>> parallel I/O, I've seen computer science and physics students making quite
>> easily the transition from POSIX I/O done on a shared file system to
>> MPI-I/O. They get sometimes an index wrong, but mostly the conversion is
>> painless. After that, my impression has become that it's mostly lazyness and
>> the attitude 'POSIX is everywhere anywhere, why should I bother with
>> something that might be missing' that keeps applications at this stage.
>>
>>
> I agree with your considerations about laziness and the POSIX-inertia.
> However, there is still a long way to make programs and programmers
> at least consider the restrictions imposed by network and file systems,
> not to mention to use proper parallel I/O.
> Hopefully courses like yours will improve this.
> If I could, I would love to go to Heidelberg and take your class myself!
>
> Regards,
> Gus Correa
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20090701/fbc2284c/attachment.html>