Archives

- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

# [Beowulf] Win64 Clusters!!!!!!!!!!!!

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Robert G. Brown rgb at phy.duke.edu
Thu Apr 12 06:35:02 PDT 2007

On Wed, 11 Apr 2007, Richard Walsh wrote:

>> Hear hear! For self-adapting softare you *can't* distinguish
>> instructions from data. That may sound over-specialized but I invite
>> you to consider DNA and what it does: instruct enzymes to modify DNA.
>> An awful lot comes out of that process. So I don't think it's just von
>> Neumann's whim.
>
> On the opposite side of the argument, that code/instruction set
> (self-modified for say about a billion years [and recombined, and
> mutated]) is only ~3 billion base pairs in length. If we liberally give
> each base pair a byte, that's only 3 billion bytes. We can quibble about
> what an instruction is, but also about how much of the code is used and
> what would be the reduced instruction set equivalent.

There are two very important points here.  One is that Jon's assertion
in one sense is correct.  Even humans who type like the wind (immodestly
offering myself as an example as my friends tell me that I'm fairly
"prolific":-) can type at a >sustained rate< of only 2-4 characters a
second, if that.  1 gigasecond is roughly 10*\pi \approx 32 years (old
rule of thumb approximation: 1 year equals \pi \times 10^7 seconds).
That means that a human being, typing 4 characters (bytes) per second,
would require 32 years of continual sustained effort to generate 2^32
bytes worth of production.  There is no question that no single human --
not even Isaac Asimov or some of the other truly prodigious authors has
ever come close to writing the 4000-8000 full length books (a typical
novel has between a half-megabyte and a megabyte of tex) required.
(Asimov wrote 500 books and an estimated 9000 letters and postcards,
according to Wikipedia, which would make him quite exceptional --
perhaps as much as a gigabyte over a career.)

Nobody could even THINK of writing software at this rate, sustained for
the source code, and even allowing for amplification via the usual
encapsulation of code via a compiler and associated libraries it seems
rather likely that there are only a handful of humans who write 4 GB
worth of actual production binary code over a lifetime, not including of
course the many discarded transient versions as one builds and tests
(which can EASILY fill GB as one works, as one will rapidly learn if one
code-version controls these images at the many intermediate stages).  So
let us take it for granted that no single human has produced or is
likely to produce a single binary executable image that exceeds 4 GB in
size (except in the context of checkpointing and saving to disk a
runtime image of all of memory on a larger memory system or in the
context of creating VM images, both of which are RELEVANT exceptions as
they are actions of an upper level piece of software that captures a
working environment just as much as compiling and linking).  But the
basic point stands -- humans cannot type that fast, and so far "single
applications" (except for the aforementioned VM images of entire running
systems) do not link in aggregate resources that exceed 4 GB in length.
To gain a still better appreciation of this, note that an entire linux
distro, binaries, libraries and all, will often fit easily into a 4 GB
partition.

So it is a true statement BUT it is also IRRELEVANT to the entire issue
of CPU engineering.  CPUs have not been single threaded application
engines since DOS.  Really since BEFORE DOS, but DOS and friends were
basically single-tasking engines, and the early 808x CPUs were
engineered as such as those of us who coded them in assembler well
remember.  CX, SX, DX, EX (unused) as two-byte offset segment registers,
combined with a two-byte (64K) address relative to the offset to produce
a 64K "code address space", or "stack address space" or "data address
space" each with its own addressing and/or pointer to manipulate.

On this kind of segmented CPU, single-threaded kernel architecture one
could sensibly talk about "code addresses" as somehow distinct from
they were -- IIRC one generally set EX to SX and henceforth ignored it
in writing actual code, but I haven't written 808x code for 25 years and
could be recalling INcorrectly:-).

Programmers hated this, and lamented the fact that 808x wasn't a sweet,
flat memory model CPU like the (competing) Motorola 68000 used in the
Macintoshi of the day, but PCs were cheap, Macs were dear, and even in
DOS days a command line was a marvelous thing compared to HAVING to use
a mouse to accomplish every task.

A few processor generations later 80x86 flattened memory (while
NATURALLY retaining segmented addressing capabilities so the CPUs could
still run DOS binaries -- I can STILL run DOS binaries for that matter
AFAIK as Intel and AMD maintain backwards capabilities ad nauseam --
those mammals might NEED those gills one day as ontogony eternally
recapitulates philogeny in evolutionary systems:-).  Not TOO long after
(say, a decade and change:-) true multitasking operating systems built
on flat memory models emerged for first the 80386 (SunOS in the form of
the Sun 386i and a few other small Unices in the late 80's/early 90's),
then Linux and freebsd and even WinNT and OS/2 in the early to mid
90's).

Since the emergence of true multitasking kernels with true virtual
memory, true runtime shared/DL libraries, and automated buffering and
caching capabilities running on top of a flat-memory processor and
memory subsystem , there has been NO POINT in distinguishing the
ADDRESSING of "code" memory, "stack" memory, "heap" memory, "data"
memory, "cache" memory, "buffer" memory, or any of the various DMA or
memory mapped kinds of device memory that take up addresses in the
generalized addressible memory space of a given system.  I think that is
what has confused this discussion from the start.  Jon made a true
observation, but it really had nothing to do with whether or not 64 bit
processors with larger flat memory spaces were a good thing or a bad
thing.  So people (myself included) jumped up to defend them because
they are a GOOD thing without question, even for cheap PCs, and not just
a matter of "marketing" 64 bits as being twice as big as 32 bits and
bigger must be better.  Bigger IS better, and faster is better still and
64 bit CPUs are both bigger and faster and enable whole new classes of
application to be written and run (including, of course, the large DATA
memory applications of interest to many on this list).

To make this point still clearer, let us return to the VM exception
above.  In one very fundamental sense, CPUs are STILL executing "a
single thread".  There is just one instruction pointer that tells the
CPU what instruction it will execute "next".  CPUs are still serial
execution units (even as they internally parallelize to do this or
that, that doesn't prevent this from being true).

What, then, is the "program" that they run?  >>EVERYTHING!<< The
"singled threaded" program running on my laptop is the sum total
contents of memory, with the kernel, not the hardware, in control of its
memory utilization.  Fragments of the full binary code image are
scattered here and there in main memory according to the whim and needs
of the kernel, interspersed with "stacks", "heaps", "data", "buffers",
"cache" and so on.  This "main memory" is further extended as virtual
memory into swap space and nonresident pages from the disks.  The
addressing scheme for "the program" running on a computer has to be able
to embrace ALL of this under the perfect control of the kernel.  A flat
memory model is clearly vastly simpler to program for this purpose than
a segmented memory model, and an unsegmented model that "required" all
code pages to be loaded in the first 4 GB of memory on large memory
systems would be downright evil as it breaks the kernel's symmetry and
forces a whole layer of additional conditional statements in its memory
management code.

Now THIS single threaded "program", with its mix of dynamic and
relocateable memory images where the "purpose" of any given block of
memory can change from one nanosecond to the next, where data can become
code or vice versa (think of interpreted languages, think of debuggers,
think of buffer overwrite attacks), written not by any single programmer
but by tens of thousands of programmers with millions of FTE hours of
effort over decades, with its hundreds of thousands of interchangeable
subprograms (the tasks that the kernel is running for the user) in a
near-infinity of permutations is simultaneously one of the true wonders
of the world -- one of the crowning achievements of modern humanity,
quite literally -- and can EASILY exceed 4 GB in size on a system that
has more than 4 GB of memory.  In fact, the kernel will almost certainly
spread out and use all available memory but a tiny bit held in reserve
to satisfy the immediate needs for free memory blocks of the programs
themselves, the kernel, and startup of interactive tasks just to make
itself run more efficiently.

64 bit CPUs are engineered to permit THIS program to exceed 4 GB in size
in a symmetric way that does not place any restrictions on what is or is
not "code memory".  The kernel is in control, and has a big, flat
"desktop" to work with, as should be.  If it chooses to dub a segment of
memory past the 4 GB boundary as holding "code", it can and will.  If a
few moments later this memory becomes "data" that's fine as well.  It
can even be "code" one instruction and "data" a couple of instructions
later as one single-steps through a debugger located physically
somewhere completely different in memory while the kernel periodically
is interrupted to process device requirements with still different
chunks of code memory.

THIS is why folks get "upset" at any suggestion that 64 bit processors
are a bad thing and we could have done just as well extending "only the
data address space" as if that is somehow distinguished from "code
address space".  They are distinguished only by the kernel, and the
kernel AND ALL ITS SUBTASKS are "the program" that 64 bits enables to run
past the 4 GB boundary.

I hope that is now clear.

rgb

--
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu