C++ programming (was Newbie Alert: Beginning parallel program ming with Scyld)

Robert G. Brown rgb at phy.duke.edu
Fri Oct 18 12:29:04 PDT 2002

On Thu, 17 Oct 2002, Gerry Creager N5JXS wrote:

> Strictly speaking, an accomplished Fortran programmer (OK.  I see you 
> out there.  Stop giggling!) goes through 3 phases of accomplishment when 
> learning about parallelization.
> 1.  Too dumb to follow convention.  Loops are almost always simply 
> unrolled and parallelizable.
> 2.  Learned to follow the herd.  Loops are consistent with convention. 
> Must decompose the entire program to make the one parallelizable loop in 
> the middle a little more efficient.  Rest of code should have stayed 
> linear but now suffers processing delays when CPU laughs at program 
> structure.
> 3.  Learned what to do to parallelize code.  Segregates parallel code 
> from serial code.  Processes each appropriately.  Trusts no compiler. 
> Looks at assembly output for flaws.  Lives on Twinkies, Jolt and Cheetos 
> (crispy, pepper-hot).

<rgb type="rant" category="ignorable" on_topic_index="somewhat"

And just what does any of this have to do with Fortran?  Especially
number 3?  Is it that Fortran programmers take two steps to get to step
3, while C programmers already live on TJC, trust no compiler, and
recognize that they'd damn well better learn what to do to parallelize
code because no silly-ass compiler be able gonna do it for them?  Heck,
C compilers won't even >>serialize<< code for you...I put instructions
out of order all the time;-)

A point to raise before we pass into a state of outright war (not with
you specifically Gerry, just with the discussion:-) is that there are
Libraries, and there is the Compiler.  The compiler is this thing that
turns higher level logic/action constructs into machine code, maybe
taking a detour through Mr.  Assembler on the way.  Libraries are these
collections of reusable code, accessible at a higher level via an API.
Code that compiles in a lovely way will not run unless linked to these

This discussion has almost from the beginning confused the two.

C is arguably the thinnest of skins between the programmer and the
machine language, the thinnest of interfaces to the operating system and
peripheral devices.  For that reason it is generally preferred, I
believe, for doing things like writing operating systems and device
drivers, where you don't WANT a compiler doing something like
rearranging your code at some higher level.  It is also one of the most
powerful languages -- one of the major complaints against C is that it
provides one with so LITTLE insulation agains all that raw power.  Wanna
write all over your own dataspace and randomly destroy your program?
With C you can.  Other languages might stop you, which is great a lot of
the time but then stops you from doing it when it might really be
clever, deliberate, and useful.  A C programmer has to be the most
disciplined of programmers because with unholy power comes unholy
responsibility or you'll spend an unholy amount of time dealing with
memory leaks, overwriting your own variables, and sloppy evil.  But it
can be oh, so efficient!

C++, Fortran, Pascal, Basic all add or modify things to this basic skin.
A lot of what they modify is syntactical trivia -- braces vs end
statements to indicate logical code blocks, = vs := for assignment, ==
vs .eq. for logical equality.  This sort of "difference" is irrelevant
-- a good perl script could translate from one syntax to the other and
in fact some good perl scripts do.

However, the issue of "fortran can parallelize better than C" (or can
parallelize at all) goes beyond differences in the language syntax.  The
issue there is whether parallelization is better done (or is done at
all) at the level of the compiler (translator to machine language
statements) or with libraries.  Is it, should it be, intrinsic to the
language constructs themselves or a deliberate choice engineered into
the code.

There has been debate about this over the ages, but my own opinion is
that none of the existing "popular" languages are designed in a way to
facilitate parallelism at the compiler level or (for that matter)
vectorization, with the possible exception of APL, which actually had a
hellacious way with arrays, where formulae like x = Ay with x and y
vectors and A an array coded a lot like x <- Ay (where allowance should
be made by any APL experts out there for the fact that I haven't
actually used it in about twenty years;-).  C, F-whatever, C++ -- all of
them would either do this with explicit loops in the code itself or with
library calls, where the library would HIDE from the programmer the
loops but they would be there nonetheless, likely written in code that
was itself compiled from C or F or whatever source.  In APL those loops
are STILL there, but completely and inextricably hidden from the user in
the compiler itself.

This may sound like a silly distinction, but it is not.  Before thinking
of the really complicated parallel case, consider the much simpler case
of (single threaded) linear algebra.  If you like, there are many BLAS.
There are good BLAS, bad BLAS, ATLAS BLAS.  If you don't like your BLAS,
you can change it, and provided you program via a BLAS-based API, you
don't even have to change your code, ditto of course for higher order
linear algebra or other libraries.  Consider how a regular compiler
could deal with parallel BLAS.  Consider how one could link a regular
BLAS with code compiled with a "parallel compiler".

The real question is then, what SHOULD be done by the compiler and what
SHOULD be done by the programmer with libraries, not just in a parallel
environment but in any environment? C has always kept the compiler
itself minimal and simple.  Even math (at one time "intrinsic" to
fortran) is >>linked<< as a C library, because there are actually good
ways and bad ways to code something as simple as sin(x), and it doesn't
make sense to have to completely change compilers to replace the
operational function calls.  Imagine buying a compiler with intrinsic
BLAS if you had to buy a different revision for each hardware
combination (to get ATLAS-like tuning).  Oooo, expensive.

There it gets down to the hardware.  If the hardware supports just one
best way of doing something like evaluating e.g. sin(x) or doing a x =
Ay, then writing a compiler to support it as an elementary operation
makes sense.  Just in the x86 family's evolution, however, I've watched
8 bit 8088 give way to 16 bit 8086, 8086 give way to 8086+8087,
8086+8087 give way to 486 (unifying the command operations) and on to P5
and P6's, just to indicate a single architecture, where I used fortran
compilers on this lot at least sometimes up to just about the 486.

Well, the original 8088/8086 fortran just ignored the 8087, and
replacements were expensive and slow to arrive.  One had to hand code
assembler subroutine libraries to replace things like sin(x) in order to
experience about a tenfold speedup in numerical code (to a whopping oh,
100 Kflops:-).  I wrote a bunch of them, then fortran started to
directly support the 8087 instructions, then I stopped using fortran and
never looked back.

The moral of this story is that the "parallel constructs" in fortran are
at least partly an illusion created by building certain classes of
optimizing library calls into the compiler itself, which is a generally
dangerous thing to do and also expensive -- requires constant
maintenance and retuning (which is partly what you pay for with
compilers that do it, I suppose).  For some, the performance boost you
can get without rewriting your code is worth it (where the "without
rewriting your code" is a critical feature, as the most common reason I
hear for people to request fortran support is "I have this huge code
base already written in fortran and don't want to port", not "I just
love fortran and all its constructs":-).  If you DO have to rewrite your
code anyway, then the thinness of the C interface provides a clear
advantage that interpolates the non-portability of naked assembler and
the convenience of x = Ay constructs at the compiler level, and because
you will be "forced" to use libraries even for simple math, you'll be
forced to confront library efficiency and algorithm.  You'll probably do
better on a rewrite than you ever would with a "parallelizing compiler"
and no rewrite, Gerry's original point.

This is why I don't think there is really much difference between the
major procedural or OO compilers for the purposes of writing parallel or
most other code (lisp, apl etc excepted).  They all have loops,
conditionals, subroutine and function calls.  They all support a fairly
wide and remarkably similar range of data types that may or may not
successfully create a layer of abstraction for the programmer depending
on how religious the programmer and the compiler are about rules (Wanna
access your double array in C as a linear char string?  Sure, no
problem...:-).  Some folks like a more disciplined compiler that spanks
them should they try this, or forces them to access their data objects
only through a "protected" interface via a method.  Others like to live
dangerously and have access to the raw data whenever they like, for good
or evil purpose.  But this is just a matter of personal preference and
educational history, no matter what the religious zealots of both sides
would claim, and not worth fighting about.

In parallel programming in particular (yes, this post IS relevant to
beowulfery, yes it is:-) this issue is of extreme importance.  A true
"parallel compiler" (in my opinion) would be something that could be fed
very simple constructs such as x = A*y in code) that would spit out a
distributeable executable that one could then "execute" with a single
statement and have it run, in automagic parallel, on your particular
parallel environment.  So far, I don't think there has been a dazzling
degree of success with this concept even with dedicated parallel
hardware -- it ends up being done with library calls and not by the
compiler even then, and even then it doesn't always do it very WELL
without a lot of work.

Compared to dedicated hardware, beowulfs can be architected in many
ways, with lots of combinations of hardware, memory cache, network
speeds and types, latencies -- dazzlingly complex.  Even with a true
"beowulf operating system", a flat pid space, a single executable line
head node interface (such as Scyld is building) writing a parallel
>>compiler<< would be awe inspiringly difficult.  Much simpler to leave
the parallelization to either the user (via library calls) in a message
passing model or at worst foist the problem off on the operating system
by creating one of the distributed shared memory interfaces -- CC-NUMA
or the like -- that hides IPC details from even the compiler and

I won't say it'll never happen, only that I don't THINK that it'll ever
happen.  Things change too quickly for it to even be a good idea, at
least using todays COTS hardware.  Until then, I think that all wise
programmers will pretty much ignore statements like "fortran compilers
can parallelize better than _____ compilers" -- compilers don't, or
shouldn't, parallelize at all, and code written for a serial system
should almost certainly be REwritten to run in parallel, at least if you
care enough about speedup that you bother to get a parallel machine in
the first place.  At the library level, you aren't comparing compilers,
you're comparing libraries, and may even be able to use the same library
in multiple compilers.

So let's be very careful, in our religious wars concerning compilers
suitable for working with parallel computers, beowulfs in particular, to
differentiate between "true" differences between compilers -- ways they
do things that are fundamentally different and relevant to
parallelization and their irrelevant syntactical differences e.g.  x**y
vs pow(x,y) or {} vs do end.  Let us also be sure to leave out the
equally irrelevant issues of whether or not objects, protection and
inheritance, classes and so forth are good or bad things -- you may like
them, I may not, and so what does that have to do with parallelization?

As far as parallelization is concerned, PVM or MPI or sockets are PVM or
MPI or sockets, in fortran or in c or in c++.  All that changes is the
syntax and call methodology of the API, and even that doesn't change
much.  That there might be trivial advantages here, or disadvantages
there, for particular problems, comes as no suprise.  It is a GOOD thing
that these are NOT features of the compiler, and a BAD thing to suggest
to potential newbie parallel programmers that they "must" use one
compiler or another to write good parallel code or to suggest that one
compiler "parallelizes" code at all, let alone better than another.


Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

More information about the Beowulf mailing list