[Beowulf] Parallel Programming Question

Gus Correa gus at ldeo.columbia.edu
Wed Jul 1 14:43:04 PDT 2009


Hi Bogdan, list

Bogdan Costescu wrote:
> On Tue, 30 Jun 2009, Gus Correa wrote:
> 
>> My answers were given in the context of Amjad's original questions
> 
> Sorry, I missed somehow the context for the questions. Still, the 
> thoughts about I/O programming are general in nature, so they would 
> apply in any case.
> 
>> Hence, he may want to follow the path of least resistance, rather than 
>> aim at the fanciest programming paradigm.
> 
> Heh, I have the impression that most scientific software is started like 
> that and only if it's interesting enough (f.e. survives the first 
> generation of PhD student(s) working on/with it) and gets developed 
> further it has some "fancyness" added to it. ;-)
> 

I can only say something about the codes I know.
A bunch of atmosphere, ocean, and climate models have been quite 
resilient.  Not static, they evolved, both in the science and in the
programming side, but they kept some basic characteristics,
mostly the central algorithms that are used.

In some cases the addition of programming "fanciness"
was a leap forward:
e.g. encapsulating MPI communication in modules and libraries.
In others cases not so much:
e.g. transitioning from F77 to F90 only to add types for everything
(and types of types of types ...),
and 10-level overload operators to do trivial things,
which did little but to slow down some codes and make them hard
to adapt and maintain.

>> Nevertheless, what if these 1000 jobs are running on the same cluster, 
>> but doing "brute force" I/O through each of their, say, 100 processes? 
>> Wouldn't file and network contention be larger than if the jobs were 
>> funneling I/O through a single processor?
> 
> The network connection to the NFS file server or some uplinks in an 
> over-subscribed network would impose the limitation - this would be a 
> hard limitation: it doesn't matter if you divide a 10G link between 100 
> or 100000 down-links, it will not exceed 10G in any case; in extreme 
> cases, the switch might not take the load and start dropping packets. 
> Similar for a NFS file server: it certainly makes a difference if it 
> needs to serve 1 client or 100 simultaneously, but beyond that point it 
> won't matter too much how many there are (well, that was my experience 
> anyway, I'd be interested to hear about a different experience).
> 

We are on the low end of small clusters here.
Our cluster networks are small single switch type so far.

>> Absolutely, but the emphasis I've seen, at least for small clusters 
>> designed for scientific computations in a small department or research 
>> group is to pay less attention to I/O that I had the chance to know 
>> about. When one gets to the design of the filesystems and I/O the 
>> budget is already completely used up to buy a fast interconnect for MPI.
> 
> That's a mistake that I have also done. But one can learn from own 
> mistakes or can learn from the mistakes of others. I'm now trying to 
> help others understand that the cluster is not only about CPU or MPI 
> performance, but about the whole, including storage. So, spread the word 
> :-)
> 

I advised the purchase of two clusters here, one of which I administer.
In both cases the recommendation to buy equipment to support a parallel
file system was dropped based on budget constraints.
This may explain my bias toward the
"funnel all I/O through the master processor" paradigm.
I do recognize the need for parallel file systems, and the appropriate
use of MPI-I/O (and derivatives of it like parallel HDF5, parallel 
NetCDF, etc) to explore this capability and avoid I/O bottlenecks.

>>> >  [ parallel I/O programs ] always cause a problem when the number 
>>> of >  processors is big.
>> Sorry, but I didn't say parallel I/O programs.
> 
> No, that was me trying to condense your description in a few words to 
> allow for more clipping - I have obviously failed...
> 
>> The opposite, however, i.e., writing the program expecting the cluster 
>> to provide a parallel file system, is unlikely to scale well on a 
>> cluster without one, or not?
> 
> I interpret your words (maybe again mistakenly) as a general remark and 
> I can certainly find cases where the statement is false. If you have a 
> well thought-out network design and a NFS file server that can take the 
> load, a good scaling could still be achieved - please note however that 
> I'm not necessarily referring to a Linux-based NFS file server, an 
> "appliance" (f.e. from NetApp or Isilon) could take that role as well 
> although at a price.

I am sure you can find counter examples.

However, for the mainstream barebones NFS/Ethernet file server that
many small clusters use, the safest (or less risky)
thing to do is to funnel I/O through the master node
on a non-I/O-intensive parallel program.
Or use local disks, if the I/O is heavy.

It is a poor man's approach, admittedly, but an OK one,
as it tries to dodge the system's bottlenecks as much as it can.
However, it doesn't propose a solution to remove those bottlenecks,
which is what parallel file systems and MPI-I/O intend to do,
I presume.

> 
>> If you are on a shoestring budget, and your goal is to do parallel 
>> computing, and your applications are not particularly I/O intensive, 
>> what would you prioritize: a fast interconnect for MPI, or hardware 
>> and software for a parallel file system?
> 
> A balanced approach :-) It's important to have a clear idea of what "not 
> particularly I/O intensive" actually means and how much value the users 
> give for the various tasks that would run on the cluster.
> 

In coarse terms, I would guess that a program is not particularly I/O 
intensive if the computation and (non-I/O) MPI communication
effort is much larger than the I/O effort.

Typically we do I/O every 4000 time steps or so, in rare cases
every ~40 time steps.
In between there is heavy computation per time step,
and MPI exchange of boundary values (a series of 2D arrays)
every time step.
If the computation is halted once every 4000 steps to funnel I/O through
the master processor, it is not a big deal in overall effort.

Other parallel applications (scientific or not) may have a very 
different ratio, I suppose.

>> Hopefully courses like yours will improve this. If I could, I would 
>> love to go to Heidelberg and take your class myself!
> 
> Just to make this clearer: I wasn't teaching myself; based on Uni 
> Heidelberg regulations, I'd need to hold a degree (like Dr.) to be 
> allowed to do teaching. 

Too bad that Heidelberg is so scholastic w.r.t. academic titles.
They should let you teach.
I always read your postings, and learn from them.
Your students should subscribe to this list, then!  :)

> But no such restrictions are present for the 
> practical work, which is actually the part that I find most interesting 
> because it's more "meaty" and a dialogue with students can take place ;-)
> 

Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------



More information about the Beowulf mailing list