[Beowulf] Demo-cluster ideas

Mon Feb 8 05:39:44 PST 2016

On 2/5/16, 9:58 AM, "Douglas Eadline" <deadline at eadline.org> wrote:

>
>> What I find interesting about this is that there's only a 3:1 difference
>> between high and low.
>>
>> That's a pretty compelling argument that if you need a 10x speedup,
>>you're
>> not going to get it by "buy a faster computer", and that, rather,
>> parallelism is your friend.
>
>And the clocks go from 2.5 to 3.2 GHz.

I¹m not sure how closely CPU rate correlates with computational
performance these days.  It¹s all about architecture and things like
memory bandwidth.

It used to be that 2x clock gave you 2xMIPS.  Today I view any stated
clock rates as essentially part of the part number to allow you to
distinguish between models.  While it¹s not as bad as ³peak music watts²
in the pre-FTC audio amplifier days, I think all the clock rate tells you
is that *for the same chip architecture more Ghz is faster than less Ghz*
and that¹s about it.

You¹re certainly not laying out a PWB to run at 3GHz.

>
>I'm not sure how much farther multi-core can go
>with adding cores.

Well, multi-core is just parallel/cluster computing on a smaller scale
with more cross-resource coupling - maybe more like a crossbar switch to
memory.

For the vast majority of software out there (e.g. People running Excel,
most PC apps, etc.) multi core seems to just be a way to spread threads
across multiple CPUs, perhaps saving some context switch time, and keeping
the memory interface full - They¹re all still hitting the same big RAM,
network, and disk drive.

On my desktop and notebook computers, it seems the programs that
consistently sucks up a whole core is the virus scanner and the disk
directory indexing tools  - neither of these would be an issue in HPC, I
suspect.

To go farther, software will have to undergo a significant change in
architecture to one that is more amenable to hardware architectures more
like a classic Beowulf: standalone nodes with some communication fabric;
There¹s still the problem that making RAM is very different than making
CPU - the chip design is fundamentally different.

This is the problem that Tilera/EZ-Chip or Angstrom face.

The TILE-Gx72 has 72 cores, but only 4 DDR memory ports.  There¹s 22Mbyte
of on-chip cache - spreading that out across all the cores means that one
core really has 1/3 Mbyte for its share of the cache. So you need a task
that is pretty fine grained to take advantage.  Sure, you¹re not in the
vector pipeline/parallel SIMD world, but it¹s still each CPU has to have a
very high locality of data reference or the system becomes memory bound.
>