[Beowulf] Moores Law is dying

Joe Landman landman at scalableinformatics.com
Tue Apr 14 14:55:09 PDT 2009


Jon Forrest wrote:
> Joe Landman wrote:
> 
>> ... so I see you have never used an interprocedural analysis (-ipa) 
>> switch :)
>>
>> Allows you do do things like, I dunno, inline one whole routine inside 
>> another ...
> 
> I've never used this but from your description I don't
> see how it leads to larger text sizes at runtime. After all, if you have
> routine A which is 10 bytes, and routine B which is 20 bytes,
> it would seem that they collectively take 30 bytes no matter
> if they stand alone or one inside the other. I might not
> be understanding this right, though.

More like N*20 bytes ... use the routine more than once :)

> 
>> Usually leads to much larger program text sizes.
>>
>> This said, I have seen very large programs from RISC days hitting well 
>> more than 1 GB of text.  I haven't played with any recently though.
> 
> Let's say this is about right. Do you see such programs getting
> even larger in the future?

Sadly, yes.

>>> Why is sharing expensive in performance? It might take a little
>>> overhead to setup and manage, but why is having multiple virtual
>>> addresses map to the same physical memory expensive?
>>
>> Contention.  Memory hot spots.  Been there, done that.  We are about 
>> to do this all over again (collectively).
> 
> Naively I would think that text memory hot spots would be a good
> thing, because then all the benefits of caching would kick in.
> There would be no cache coherence overhead since text is read-only.
> Why is this a bad thing?

Ohhhh.... You *really* don't want your system brought to its knees over 
false sharing.  Its a great way to turn a large expensive machine into a 
very slow large expensive machine.  Listen to Greg Lindahl, and he'll 
likely point to this as one of the great fallicies of 'why shared memory 
is better' than distributed memory :) (not shoving words into his mouth, 
so if he has changed his mind or thinks differently ... thats ok)

Imagine you are a processor, and you have written to a location in ram. 
  So now your cache line is dirty, and waiting in queue to be flushed 
out.  In your parallel program, along comes someone else who really, 
really wants to read that cache line.  Ok, so this forces you to a) 
flush it now, b) mark that line as clean.  Then the next CPU gets that 
cache line, does it's write, and whammo, some other CPU wants to do the 
same thing to it as you did.

Sadly enough this is a common programming error in shared memory 
programming.  Think of it like you have a bunch of loops operating in 
parallel, all trying up update the same counter, at once.  In parallel.

Each update has to wait until it can grab the cache line, and then it 
proceeds.  The more updaters you have, the more contention for that 
resource you have.  Your performance scales as 1/N rather than (constant)*N.

Now do this with a page at a time, say a buffer.  Like, I dunno, an 
Infiniband MPI buffer, or a 10 GbE MPI buffer.  Throw more CPUs behind 
this buffer, and force them to get in line to shoot data over to their 
counterparts.  The IB or 10 GbE resource becomes contended for, and as 
you increase Ncpu, the contention and performance loss gets worse and 
worse (this is basically what Doug Eadline is worried about).

There are ways you can work around some of this stuff.  Share nothing is 
one way, though this is hard to do at an OS level where you share IO 
devices etc.  Allocate some private memory queues, a scheduler, and 
other bits (you have to do this with Cuda systems and most accelerators 
to get reasonable performance).

I know you might postulate that 32 bit text is effectively the CS 
equivalent of "C" in physics ... you may approach it asymptotically, but 
never actually get there ... but unlike in physics, there isn't really 
an underlying reason why you might not get there.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615



More information about the Beowulf mailing list