[Beowulf] Parallel Programming Question

Tue Jun 30 17:06:25 PDT 2009

Hi Bogdan, list

Oh, well, this is definitely a peer reviewed list.

My answers were given in the context of Amjad's original
questions, and the perception, based on Amjad's previous
and current postings, that he is not dealing with a large cluster,
or with many users, and plans to both parallelize and update his code 
from F77 to F90, which can be quite an undertaking.
Hence, he may want to follow the path of least resistance,
rather than aim at the fanciest programming paradigm.

In the edited answer that context was stripped off,
and so was the description of
"brute force" I/O in parallel jobs.
That was the form of concurrent I/O I was referring to.
An I/O mode which doesn't take any precautions
to avoid file and network contention, and unfortunately is more common
than clean, well designed, parallel I/O code (at least on the field I 
work).

That was the form of concurrent I/O I was referring to
(all processors try to do I/O at the same time using standard 
open/read/write/close commands provided by Fortran or another language,
not MPI calls).

Bogdan seems to be talking about programs with well designed parallel 
I/O instead.

Bogdan Costescu wrote:
> On Wed, 24 Jun 2009, Gus Correa wrote:
> 
>> the "master" processor reads... broadcasts parameters that are used by 
>> all "slave" processors, and scatters any data that will be processed 
>> in a distributed fashion by each "slave" processor.
>> ...
>> That always works, there is no file system contention.
> 
> I beg to disagree. There is no file system contention if this job is the 
> only one doing the I/O at that time, which could be the case if a job 
> takes the whole cluster. However, in a more conventional setup with 
> several jobs running at the same time, there is I/O done from several 
> nodes (running the MPI rank 0 of each job) at the same time, which will 
> still look like mostly random I/O to the storage.
> 

Indeed, if there are 1000 jobs running,
even if each one is funneling I/O through
the "master" processor, there will be a large number of competing
requests to the I/O system, hence contention.
However, contention would also happen if all jobs were serial.
Hence, this is not a problem caused by or specific from parallel jobs.
It is an intrinsic limitation of the I/O system.

Nevertheless, what if these 1000 jobs are running on the same cluster,
but doing "brute force" I/O through
each of their, say, 100 processes?
Wouldn't file and network contention be larger than if the jobs were
funneling I/O through a single processor?

That is the context in which I made my statement.
Funneling I/O through a "master" processor reduces the chances of file
contention because it minimizes the number of processes doing I/O,
or not?

>> Another drawback is that you need to write more code for the I/O 
>> procedure.
> 
> I also disagree here. The code doing I/O would need to only happen on 
> MPI rank 0, so no need to think for the other ranks about race 
> conditions, computing a rank-based position in the file, etc.
> 

 From what you wrote,
you seem to agree with me on this point, not disagree.

1) Brute force I/O through all ranks takes little programming effort,
the code is basically the same serial,
and tends to trigger file contention (and often breaks NFS, etc).
2) Funneling I/O through the master node takes a moderate programming
effort.  One needs to gather/scatter data through the "master" 
processor, which concentrates the I/O, and reduces contention.
3) Correct and cautious parallel I/O across all ranks takes a larger 
programming effort,
due to the considerations you pointed out above.

>> In addition, MPI is in control of everything, you are less dependent 
>> on NFS quirks.
> 
> ... or cluster design. I have seen several clusters which were designed 
> with 2 networks, a HPC one (Myrinet or Infiniband) and GigE, where the 
> HPC network had full bisection bandwidth, but the GigE was a heavily 
> over-subscribed one as the design really thought only about MPI 
> performance and not about I/O performance. In such an environment, it's 
> rather useless to try to do I/O simultaneously from several nodes which 
> share the same uplink, independent whether the storage is a single NFS 
> server or a parallel FS. Doing I/O from only one node would allow full 
> utilization of the bandwidth on the chain of uplinks to the file-server 
> and the data could then be scattered/gathered fast through the HPC 
> network. Sure, a more hardware-aware application could have been more 
> efficient (f.e. if it would be possible to describe the network 
> over-subscription so that as many uplinks could be used simultaneously 
> as possible), but a more balanced cluster design would have been even 
> better...
> 

Absolutely, but the emphasis I've seen, at least for small clusters
designed for scientific computations in a small department or research 
group is to pay less attention to I/O that I had the chance to know about.
When one gets to the design of the filesystems and I/O the budget is 
already completely used up to buy a fast interconnect for MPI.
I/O is then done over Gigabit Ethernet using a single NFS
file server (often times a RAID on the head node itself).
For the scale of a small cluster, with a few tens of nodes or so,
this may work OK,
as long as one writes code that is gentle with NFS
(e.g. by funneling I/O through the head node).

Obviously the large clusters on our national labs and computer centers
do take into consideration I/O requirements, parallel file systems, etc. 
However, that is not my reality here, and I would guess it is
not Amjad's situation either.

>> [ parallel I/O programs ] always cause a problem when the number of 
>> processors is big.
> 

Sorry, but I didn't say parallel I/O programs.
I said brute force I/O by all processors (using standard NFS,
no parallel file system, all processors banging on the file system
with no coordination).

> I'd also like to disagree here. Parallel file systems teach us that a 
> scalable system is one where the operations are split between several 
> units that do the work. Applying the same knowledge to the generation of 
> the data, a scalable application is one for which the I/O operations are 
> done as much as possible split between the ranks.

Yes.
If you have a parallel file system.

> 
> IMHO, the "problem" that you see is actually caused by reaching the 
> limits of your cluster, IOW this is a local problem of that particular 
> cluster and not a problem in the application. By re-writing the 
> application to make it more NFS-friendly (f.e. like the above "rank 0 
> does all I/O"), you will most likely kill scalability for another HPC 
> setup with a distributed/parallel storage setup.
> 
Yes, that is true, but may only be critical if the program is I/O
intensive (ours are not).
One may still fare well with funneling I/O through one or a few
processors, if the program is not I/O intensive,
and not compromise scalability.

The opposite, however, i.e.,
writing the program expecting the cluster to
provide a parallel file system,
is unlikely to scale well on a cluster
without one, or not?

>> Often times these codes were developed on big iron machines, ignoring 
>> the hurdles one has to face on a Beowulf.
> 
> Well, the definition of Beowulf is quite fluid. Nowadays is sufficiently 
> easy to get a parallel FS running with commodity hardware that I 
> wouldn't associate it anymore with big iron.
> 

That is true, but very budget dependent.
If you are on a shoestring budget, and your goal is to do parallel
computing, and your applications are not particularly I/O intensive,
what would you prioritize: a fast interconnect for MPI,
or hardware and software for a parallel file system?

>> In general they don't use MPI parallel I/O either
> 
> Being on the teaching side in a recent course+practical work involving 
> parallel I/O, I've seen computer science and physics students making 
> quite easily the transition from POSIX I/O done on a shared file system 
> to MPI-I/O. They get sometimes an index wrong, but mostly the conversion 
> is painless. After that, my impression has become that it's mostly 
> lazyness and the attitude 'POSIX is everywhere anywhere, why should I 
> bother with something that might be missing' that keeps applications at 
> this stage.
> 

I agree with your considerations about laziness and the POSIX-inertia.
However, there is still a long way to make programs and programmers
at least consider the restrictions imposed by network and file systems,
not to mention to use proper parallel I/O.
Hopefully courses like yours will improve this.
If I could, I would love to go to Heidelberg and take your class myself!

Regards,
Gus Correa