[Beowulf] Parallel Programming Question

Tue Jun 30 06:54:45 PDT 2009

On Wed, 24 Jun 2009, Gus Correa wrote:

> the "master" processor reads... broadcasts parameters that are used 
> by all "slave" processors, and scatters any data that will be 
> processed in a distributed fashion by each "slave" processor.
> ...
> That always works, there is no file system contention.

I beg to disagree. There is no file system contention if this job is 
the only one doing the I/O at that time, which could be the case if a 
job takes the whole cluster. However, in a more conventional setup 
with several jobs running at the same time, there is I/O done from 
several nodes (running the MPI rank 0 of each job) at the same time, 
which will still look like mostly random I/O to the storage.

> Another drawback is that you need to write more code for the I/O 
> procedure.

I also disagree here. The code doing I/O would need to only happen on 
MPI rank 0, so no need to think for the other ranks about race 
conditions, computing a rank-based position in the file, etc.

> In addition, MPI is in control of everything, you are less dependent 
> on NFS quirks.

... or cluster design. I have seen several clusters which were 
designed with 2 networks, a HPC one (Myrinet or Infiniband) and GigE, 
where the HPC network had full bisection bandwidth, but the GigE was a 
heavily over-subscribed one as the design really thought only about 
MPI performance and not about I/O performance. In such an environment, 
it's rather useless to try to do I/O simultaneously from several nodes 
which share the same uplink, independent whether the storage is a 
single NFS server or a parallel FS. Doing I/O from only one node would 
allow full utilization of the bandwidth on the chain of uplinks to the 
file-server and the data could then be scattered/gathered fast through 
the HPC network. Sure, a more hardware-aware application could have 
been more efficient (f.e. if it would be possible to describe the 
network over-subscription so that as many uplinks could be used 
simultaneously as possible), but a more balanced cluster design would 
have been even better...

> [ parallel I/O programs ] always cause a problem when the number 
> of processors is big.

I'd also like to disagree here. Parallel file systems teach us that a 
scalable system is one where the operations are split between several 
units that do the work. Applying the same knowledge to the generation 
of the data, a scalable application is one for which the I/O 
operations are done as much as possible split between the ranks.

IMHO, the "problem" that you see is actually caused by reaching the 
limits of your cluster, IOW this is a local problem of that particular 
cluster and not a problem in the application. By re-writing the 
application to make it more NFS-friendly (f.e. like the above "rank 0 
does all I/O"), you will most likely kill scalability for another HPC 
setup with a distributed/parallel storage setup.

> Often times these codes were developed on big iron machines, 
> ignoring the hurdles one has to face on a Beowulf.

Well, the definition of Beowulf is quite fluid. Nowadays is 
sufficiently easy to get a parallel FS running with commodity hardware 
that I wouldn't associate it anymore with big iron.

> In general they don't use MPI parallel I/O either

Being on the teaching side in a recent course+practical work involving 
parallel I/O, I've seen computer science and physics students making 
quite easily the transition from POSIX I/O done on a shared file 
system to MPI-I/O. They get sometimes an index wrong, but mostly the 
conversion is painless. After that, my impression has become that it's 
mostly lazyness and the attitude 'POSIX is everywhere anywhere, why 
should I bother with something that might be missing' that keeps 
applications at this stage.

-- 
Bogdan Costescu

IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: bogdan.costescu at iwr.uni-heidelberg.de