[Beowulf] Parallel Programming Question
Bogdan Costescu
Bogdan.Costescu at iwr.uni-heidelberg.de
Tue Jun 30 06:54:45 PDT 2009
On Wed, 24 Jun 2009, Gus Correa wrote:
> the "master" processor reads... broadcasts parameters that are used
> by all "slave" processors, and scatters any data that will be
> processed in a distributed fashion by each "slave" processor.
> ...
> That always works, there is no file system contention.
I beg to disagree. There is no file system contention if this job is
the only one doing the I/O at that time, which could be the case if a
job takes the whole cluster. However, in a more conventional setup
with several jobs running at the same time, there is I/O done from
several nodes (running the MPI rank 0 of each job) at the same time,
which will still look like mostly random I/O to the storage.
> Another drawback is that you need to write more code for the I/O
> procedure.
I also disagree here. The code doing I/O would need to only happen on
MPI rank 0, so no need to think for the other ranks about race
conditions, computing a rank-based position in the file, etc.
> In addition, MPI is in control of everything, you are less dependent
> on NFS quirks.
... or cluster design. I have seen several clusters which were
designed with 2 networks, a HPC one (Myrinet or Infiniband) and GigE,
where the HPC network had full bisection bandwidth, but the GigE was a
heavily over-subscribed one as the design really thought only about
MPI performance and not about I/O performance. In such an environment,
it's rather useless to try to do I/O simultaneously from several nodes
which share the same uplink, independent whether the storage is a
single NFS server or a parallel FS. Doing I/O from only one node would
allow full utilization of the bandwidth on the chain of uplinks to the
file-server and the data could then be scattered/gathered fast through
the HPC network. Sure, a more hardware-aware application could have
been more efficient (f.e. if it would be possible to describe the
network over-subscription so that as many uplinks could be used
simultaneously as possible), but a more balanced cluster design would
have been even better...
> [ parallel I/O programs ] always cause a problem when the number
> of processors is big.
I'd also like to disagree here. Parallel file systems teach us that a
scalable system is one where the operations are split between several
units that do the work. Applying the same knowledge to the generation
of the data, a scalable application is one for which the I/O
operations are done as much as possible split between the ranks.
IMHO, the "problem" that you see is actually caused by reaching the
limits of your cluster, IOW this is a local problem of that particular
cluster and not a problem in the application. By re-writing the
application to make it more NFS-friendly (f.e. like the above "rank 0
does all I/O"), you will most likely kill scalability for another HPC
setup with a distributed/parallel storage setup.
> Often times these codes were developed on big iron machines,
> ignoring the hurdles one has to face on a Beowulf.
Well, the definition of Beowulf is quite fluid. Nowadays is
sufficiently easy to get a parallel FS running with commodity hardware
that I wouldn't associate it anymore with big iron.
> In general they don't use MPI parallel I/O either
Being on the teaching side in a recent course+practical work involving
parallel I/O, I've seen computer science and physics students making
quite easily the transition from POSIX I/O done on a shared file
system to MPI-I/O. They get sometimes an index wrong, but mostly the
conversion is painless. After that, my impression has become that it's
mostly lazyness and the attitude 'POSIX is everywhere anywhere, why
should I bother with something that might be missing' that keeps
applications at this stage.
--
Bogdan Costescu
IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: bogdan.costescu at iwr.uni-heidelberg.de
More information about the Beowulf
mailing list