[Beowulf] Win64 Clusters!!!!!!!!!!!!

Tue Apr 10 07:51:17 PDT 2007

On Sun, 8 Apr 2007, Joe Landman wrote:

>> 64-bit computing solves a real problem. For apps that
>> don't need the extra address space, the benefits of
>> the additional registers in x86-64 are nearly undone
>> by the need to move more bits around, so 32-bit
>> and 64-bit modes are pretty much a push. When you
>
> I would love to see your data for this.  Please note that I have quite a
> bit of data that contradicts this assertion (e.g. directly measured
> performance data, wall clock specifically of identical programs
> compiled to run in 32 and 64 bit mode on the same physical machine,
> running identical input decks).  This is older data, from 2004.  c.f.
> http://www.amd.com/us-en/assets/content_type/DownloadableAssets/dwamd_SI_rlsdWP1.0_.pdf
> but it is still relevant, and specifically, directly addresses the
> assertions.
>
>> add the additional difficulty of getting 64-bit drivers
>> and what-not, I don't think it's worth messing with 64-bit
>> computing for apps that don't need the address space.

...

>> One additional way 64-bit computing is being oversold
>> is that there aren't now, and maybe never will be, any
>> human written program that requires more than 32 bits
>> for the instruction segment of the program. It's simply
>
> This is a bold assertion.  Sort of like the "no program will ever use
> more than 640k of memory" made by a computing luminary many moons ago.
>
>> too complex for a human, or a group of humans, to write
>> this much code. Again, note that this says nothing

I totally agree with Joe on this issue.  The "ideal" computer would have
an infinite, flat address space, totally transparent to the user.  Want
to address memory location FF 0A BB 79 C3 12 93 54 6A 19 1D DA? (or
simply have 2^90 \approx 10^27 data objects to manage)?  The memory
should be there, flat, transparent.

Further, the "ideal" computer has a discretized binary representation of
floating point numbers that is as close as possible to the real numbers
they approximate for a variety of excellent numerical reasons.  I
remember reading any number of places how single precision floating
point numbers were perfectly adequate for doing any sort of meaningful
computation.  I remember learning the hard way just how wrong this
assertion is -- how much using double precision improves a long-running
numerical computation both by slowing the rate of accumulation of the
inevitable round-off errors and by admitting much larger exponents
without having to manage them "by hand".  I remember the joy of
discovering IEEE 80 bit arithmetic in the venerable 8087, with more
precision even than double.  I remember how much FASTER native 80 bit
arithmetic and then truncating to doubles is compared to doing double
precision using library routines on top of an 8-bit or 16-bit or even
32-bit CPU.

>> about the data segment of a program. Also, people tell
>> me that there are programs that were generated by other
>> programs that are larger than 32 bits. I've never seen
>> one, but maybe they exist, and that's what I'm talking
>> about human written programs.

I don't understand how you could possibly imagine this to be true.  I do
numerical spin simulations on lattices in D dimensions.  An
N-dimensional spin (where N is not necessarily equal to D) is typically
represented by 1-(N-1) real numbers (e.g. spherical polar angles).  In
addition any give spin may have other internal coordinates.  To
represent a spin therefore requires minimally order 4*N bytes for an
ordinary 32-bit float representation of the spin coordinates, more
likely 8*N bytes if one sensibly uses double precision coordinates.  For
3D spins say 24 bytes per site.

One then wishes to do simulations on the largest lattices possible.  The
constraint on lattice size is generally a mix of how much memory can
hold and CPU speed, noting well that for cubic lattices the number of
sites scales like L^D where L is the cube length in units of
cartesian-indexed "sites".  A 32 bit machine can address at most 4 GB of
memory; in general purpose OS implementations this is generally reduced
by the requirements of running the OS itself and a VM system to 3 GB (at
least in a single data structure, without swapping).

Well, if I put my 24 byte spins on a 1000x1000x1000 lattice I'm already
up to 24 GB of memory.  If I'm working on D=4 spaces or D=5 spaces, then
a mere 100x100x100x100x100 lattice is 24x10^10 or 240 GB in size.  Here
the speed of doing arithmetic in 64 bits native AND the larger address
space of 64 bit machines are absolutely essential to even play the game.

This isn't an isolated (if specific) example.  There is a vast range of
memory-size bound problems, some of which have modest CPU requirements
but an absolute necessity to be able to efficiently address large memory
spaces.  So much so that there have been cluster computing development
efforts that focus on building very large flat memory models at the
expense of computing speed -- the Trapeze project at Duke, for example.
Here the point isn't do do lots of computation in parallel -- the
application may even be single threaded.  The parallel computer exists
solely to provide the illusion of a vast reasonably flat memory space.
There are other groups in the physics department here who would
routinely buy 16+ GB machines (which obviously require 64 bit OS and
hardware) if only they could afford all that memory as their
computations easily scale out that far.  They generally can afford only
one or two "large memory" machines (which are still much more expensive
than 2-4 GB machines as the price premium on really large memory sticks
persists) but they'd LOVE to go large.

Personally I "wish" that they'd done the dual core thing entirely
differently.  Instead of having two completely independent 64 bit cores
per CPU, they might have built a 128-bit core with a hardware floating
point execution pathway that permitted it to be transparently broken
down into 4 32 bit parallel pathways, 2 64 bit pathways, 1 96 bit and 1
32 bit pathway, or 1 128 bit pathway, with entirely transparent flat
memory access out to 128 bits, and with hardware implementation of 128
bit integer or 128 bit floating point arithmetic (on down).  Leave it to
a mix of the CPU, the OS, the compiler, and the application to decide
how to pipeline and allocate the available ALUs, registers, cache lines,
etc. to the needs of the program.

But I'm not terribly worried.  This to some extent describes the cell
architecture, with some slop as to just where the ganging together of
smaller logic units into larger ones occurs.  And lots of very smart
people are working on this -- smarter than me for sure -- and doubtless
have far better ideas.  Stating that there is no need for 64 bit
architectures and that 32 bits is enough for anyone is basically
equivalent to stating "the systems engineers working for AMD and Intel
and IBM and Motorola are complete idiots".  This is simply not the case.

they aren't idiots, they are brilliant, and the simple fact of the
matter is that 64 bit systems are faster, smarter, bigger, better than
32 bit systems.  When AMD's opteron was first released, it was noted
that it was the fastest >>32 bit<< architecture available at the time,
because it was in aggregate faster to do 32 bit arithmetic (especially
where a lot of that 32 bit arithmetic de facto involved 64 bit floats)
on a 64 bit machine than it was on a 32 bit machine.

I have watched from the days of the Z80 and 8088 (8 bit internal, 16 bit
segmented address space) through the 8086 (16 bit internal, 16 bit
segmented address space) through the 186 (very short except as a
programmable device CPU), 286, 386, 486 (including the crippled SX),
pentium, pentium pro, etc... with similar progress by AMD and nearly
forgotten Intel competitors (Cyrix?) and completely different
progression by Motorola with its FLAT 68000 memory space right on up to
the Opteron, the 64-bit Xeon, the Athlon 64.  With sundry side trips
into Sparc, MIPS, and other workstation CPU architectures on the side,
BTW.

The process has been from the beginning been driven by a voracious
public eager to take advantage of bigger address spaces, faster
arithmetic and so on associated with larger data pathways.  I fully
expect to see 128 bit CPUs become a standard in the next decade, unless
the cell approach does indeed represent a paradigm shift away from the
notion of a "central" processing unit at all and we see instead
on-the-fly reconfigurable multiprocessing units that can gang together
to 128 bits (or even more) if that's what you need or can equally well
function as a cluster of N 32 bit "thread execution units", where the OS
kernel becomes basically a cluster operating system with a dynamic
"cluster" or processing and memory resources interconnected by what
amounts to a network.

    rgb

>
> I am sorry, but I think this may be an artificial strawman.
>
>

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu