[Beowulf] g77 limits...

Thu Feb 23 21:55:27 PST 2006

At 08:11 PM 2/23/2006, Robert G. Brown wrote:
>O
>>Next thing you know, you'll be advocating that we can split up big 
>>problems into lots of little ones and use a room full of these toy things 
>>called PCs to do the calculations.
>
>You should talk -- you're the one who wants to run them on a stack of
>these toy things called "toy things" -- like gameboy advanced cpus or
>PDAs -- right?  ;-)
>
>But only if they can be programmed in C....

go FORTH forever, no?

>>True enough.  But there might be a limit to this.  Consider the gyrations 
>>that any language goes through to provide a simple representation of 
>>certain operations that are efficient on the target computer. A good 
>>example would be FFT butterflies on a DSP oriented CPU (with hardware 
>>that supports stride indexing and in-place bitreversed addressing), or 
>>even certain vectorizable computations (filters with long sequences of 
>>multiply and accumulate).
>
>I understand this.  Indeed, for a long time that was fortran's real
>strength -- it ran super-well on a vector machine because all it did was
>arithmetic on arrays that had to be defined fortran style, that is to
>say statically for most practical purposes.  You could optimize the hell
>out of this at the compiler level, and C often sucked in comparison on
>this kind of operation.
>
>On a non-vector machine the advantage wasn't terribly pronounced, and
>then improvements in C started to match the improvements in CPU design
>-- SSE support, etc.  At this point you probably have MORE control over
>core loop design in C than in fortran -- at least a better idea of what
>is really going on on the CPU in any given code block, since C is that
>"thin veneer of upper level language sensibility on top of raw
>assembler" kinda thing.  But I wouldn't be surprised if DSPs (being
>heavily vector arithmetic kinds of things) "like" fortran-like
>contructs.

You bet... that's what a DSP excels at.. something moderately repetitive, 
where you can explicitly manage the pipeline and/or multiple 
ALUs.  Although, because of the extreme investment in the main-line 
consumer processors (especially these days), you can get interesting 
anomalies.  There was a time when you could actually do FFTs faster on a 
68010 (state of the art 16 bit micro) than on a TMS320 DSP, especially if 
you counted the time to shuffle the data off to the DSP and get it back to 
the host.  However, even in those situations, the DSP will usually blow the 
doors off the general purpose processor when it comes to things like power 
consumption, or deterministic timing (although the latter is often a 
function of the type of software for the general purpose machine).

>I actually have given a bit of thought to the notion of designing a
>multidimensional system memory -- one with a triple addessing system
>that actually has intrinsic ability to deliver memory stripes in all
>three directions, integrated with a tripled memory subsystem that feeds
>three caches into a CPU with tripled CPU pipelines.  Do vector
>calculations involving NN arithmetic in three dimensions!  For some
>classes of problems this would produce a really, really significant
>performance boost, I think.  In fact, since there are some problems
>(lattice gauge theory, for example) that are typically computed on a 4
>(or 4+1) dimensional lattice, going one or two dimensions higher might
>even be worthwhile.

I think these sorts of things actually exist already.  If you look at the 
processing done in Synthetic Aperture Radar, there's an operation called 
"corner turning" where you write by columns and read by rows.  Modern 
sophisticated radars do what's called Space Time Adaptive Processing (STAP) 
where you essentially form a series of sums over a 3D cube of data 
(multiple frequency bands, multiple time samples, multiple elements in the 
array)

>So let's see, this would only cost what, 4 to 6 billion dollars for a
>foundry?  Or is this estimate light?  Clearly worth it.

Way high.  You can get your very own custom IC made for around $5K (check 
out http://www.mosis.org/).  Yours might be a bit bigger than the minimum 
size, and you might need a faster process.  Figure about $1M for the first 
spin on a "real" ASIC (say, a million gates worth) and maybe $250-300K for 
subsequent spins.  The per unit cost when you're in production is much, 
much lower (you're basically buying highly processed sand, after all). But 
maybe what you want to do is doable in a FPGA, and then you're in the few 
hundred dollars range (and there's gnu tools to help).. Not like you're 
actually implementing the memory, but more of a sophsticated memory address 
translator scheme, and that would actually be pretty easy in a FPGA.

This is what the folks at D.E.Shaw Research are doing, from what I've heard.

>>Actually, I find it amazing that there's still an awful lot of brand new 
>>FORTRAN code being written, considering that the language was invented to 
>>solve the problems that were current in the late 50s (and had research 
>>budgets to pay for them), and has all these clunky aspects.
>
>Absolutely in-credible.  In so many ways.
>
>
>I personally think that they must have done SOMETHING that stimulated
>the addiction centers in their brains while coding back when they were
>growing up.  They get some sort of dopamine rush from it.  Or else it is
>like tattooing yourself in primitive cultures to prove your manhood,
>body piercing.

Naahh.. it's that they're focussed on their particular problem, and the 
relative efficiency or inefficiency of the coding process isn't a big part 
of the overall effort.  They're not generating transaction processing 
systems for Visa, or scanning through zillions of phone calls looking for 
interesting data, etc.

>>And yes, while it's clunky, FORTRAN is no worse than, say, assembler, for
>
>ROTFL:-) This one line deserves to be written 10^22 times in a do loop
>by all those ashamed programmers.  One that uses the 10 CONTINUE form,
>nested however many times it takes to do 10^22 iterations with INTEGER
>types.

Well.. back in the IBM 1130 days, Integers were 2 bytes, 16 bits, so you 
could get about 2**16 iterations.  That's roughly 6.5E4. Takes about 5 
nested loops to do it.

C234567890
C       COUNT FROM ONE GREATER THAN -32768 TO ONE LESS THAN 32767 TO MAKE SURE
C       WE DON'T RUN INTO PROBLEMS WITH WHERE THE DO LOOP COMPARISON IS DONE
C
       DO 100 I1=-32767,32766
       DO 100 I2=-32767,32766
       DO 100 I3=-32767,32766
       DO 100 I4=-32767,32766
       DO 100 I5=-32767,32766
100   CONTINUE

>DeSmet C on the good old IBM PC back in maybe 1984.  Cost $100 and was
>worth every penny, and by 1986 or 87 I was using Unix and the real
>thing, with jove as an editor and everything.  The fortran I've
>written/read post 1986 can be reduced to -- well, not QUITE none, but
>none willingly, and mostly in the context of helping hapless students or
>postdocs with programs they for whatever crazed reason wrote in fortran,
>or trying to convert Dark Evil (like the original diehard program) into
>C.

A standard program for doing Method of Moments analysis of antennas is the 
Numerical Electromagnetics Code, version 4. (called NEC4)  It's in 
FORTRAN.  Several people have made attempts to do it in C (including a 
remarkably lame effort using f2c), but it endures, and you too can own a 
copy (assuming you're a US Citizen and promise not to distribute it to 
foreigners without the appropriate export license) for a few hundred bucks 
from Lawrence Livermore.  The older version (NEC2) is widely available as 
source code off the net.

Just too much work to convert it to something else, and, it works just fine 
in FORTRAN, and there's no need to "improve" it.  I believe there's even 
some folks that have "cluster"ized it (mostly by replacing library calls to 
the matrix routines).  The basic computation kernel isn't all that exotic, 
so FORTRAN works ok, and is efficient.  There is a whole indutry of making 
fancier front ends (to specify the model and interpret the results), and 
those tend to be in other languages.

Jim